[BioC] VariantAnnotation locateVariants - meaning of precedesID and followsID

Tue Apr 17 17:20:35 CEST 2012

Hi,

I have a question concerning the meaning of precedesID and followsID in 
the result of the locateVariants function from the VariantAnnotation 
package.

The reference manual states:

precedesID: The id of the gene the query precedes. Only present for 
‘intergenic’ variants.
followsID: The id of the gene the query follows. Only present for 
‘intergenic’ variants.

Which I read as: The precedesID is the ID of the gene that follows the 
query i.e. the gene which is preceded by the query. The query in this 
case would be my snv position?!

I assume that the order of elements on the chromosome would be:
gene with ID followsID, query(SNV), gene with ID precedesID.

In the return value of locateVariants I find different scenarios, an 
example is given in the attached file, I show a table here:

pstrand pstart pend start end fstrand fstart fend
rs62637812 - 34611 36082 51803 51803 - 136698 140567
1:54586 + 69091 70009 54586 54586 - 136698 140567
1:145820 - 136698 140567 145820 145820 + 69091 70009
1:231454 + 322037 326939 231454 231454 + 69091 70009
1:333714 + 367659 368598 333714 333714 + 323892 328582
1:467095 + 763064 788903 467095 467095 + 367659 368598
1:715857 - 700245 714069 715857 715857 - 761586 762903
rs1044922 - 761586 762903 792263 792263 + 763064 788903
rs7545373 - 803451 812183 812284 812284 + 763064 788903
rs192553893 + 860530 871277 839873 839873 - 852953 854818
rs6673914 - 852953 854818 855075 855075 - 879583 893919
1:919987 - 910579 912022 919987 919987 + 901877 910485
rs2488989 + 948847 949920 934121 934121 - 934342 935553
rs3128114 - 934342 935553 935715 935715 + 901877 910485
rs9442388 + 955503 991500 951295 951295 + 948847 949920

I got the coordinates from the transcriptDb object I used for the call 
to locateVariants via the transcriptsBy(..., "gene") function.

I see cases with both genes upstream of the variant, or with their 
positions switched etc.
Could someone help me make sense of this, I can't see an obvious pattern 
explaining this.

Cheers,
Paul
-------------- next part --------------
> library(VariantAnnotation)
[...]
> library(BSgenome.Hsapiens.UCSC.hg19)
[...]
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
[...]
> subst <- c("RSID", "PREDICTION", "SCORE", "AACHANGE", "PROTEINID")
> tab <- TabixFile( "../my.vcf.gz" )
> seqlens <- seqlengths( BSgenome.Hsapiens.UCSC.hg19::Hsapiens )[1:24] 
> names(seqlens) <- paste( c(1:22,"X","Y") )
> which <- RangesList("1"=IRanges(50000, 1000000))
> svp = ScanVcfParam(which=which)
> vcf = readVcf(tab, "hg19", svp )
> rename <- paste("chr",seqlevels(vcf),sep="")
> names(rename) <- seqlevels(vcf)
> vcf <- renameSeqlevels(vcf, rename)
> txdb19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
> txdb19
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: UCSC
| Genome: hg19
| Genus and Species: Homo sapiens
| UCSC Table: knownGene
| Resource URL: http://genome.ucsc.edu/
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| miRBase build ID: GRCh37
| transcript_nrow: 80922
| exon_nrow: 286852
| cds_nrow: 235842
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-03-12 21:45:23 -0700 (Mon, 12 Mar 2012)
| GenomicFeatures version at creation time: 1.7.30
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0
> rd = rowData(vcf)
> rd
GRanges with 1900 ranges and 1 elementMetadata col:
              seqnames           ranges strand   | paramRangeID
                 <Rle>        <IRanges>  <Rle>   |     <factor>
   rs62637812     chr1   [51803, 51803]      *   |         <NA>
   rs76402894     chr1   [51898, 51898]      *   |         <NA>
   rs78732933     chr1   [51928, 51928]      *   |         <NA>
   rs62637813     chr1   [52058, 52058]      *   |         <NA>
      1:52096     chr1   [52096, 52096]      *   |         <NA>
      1:54586     chr1   [54586, 54586]      *   |         <NA>
      1:54844     chr1   [54844, 54844]      *   |         <NA>
      1:58176     chr1   [58176, 58176]      *   |         <NA>
      1:58211     chr1   [58211, 58211]      *   |         <NA>
          ...      ...              ...    ... ...          ...
   rs56255212     chr1 [985449, 985449]      *   |         <NA>
     1:985450     chr1 [985450, 985450]      *   |         <NA>
    rs3121562     chr1 [991724, 991724]      *   |         <NA>
  rs113866178     chr1 [991931, 991931]      *   |         <NA>
    rs2710886     chr1 [992084, 992084]      *   |         <NA>
    rs2799074     chr1 [992094, 992094]      *   |         <NA>
     1:995121     chr1 [995121, 995121]      *   |         <NA>
     1:995125     chr1 [995125, 995125]      *   |         <NA>
   rs28397086     chr1 [997408, 997408]      *   |         <NA>
  ---
  seqlengths:
   chr1
     NA
> loc = locateVariants(vcf, txdb19, AllVariants())
> loc_df <- as.data.frame(loc)
> 
> txbg <- transcriptsBy(txdb19, "gene")
> 
> loc_df <- loc_df[ is.finite(loc_df$precedesID), c("seqnames","start","end","precedesID","followsID")]
> loc_df <- loc_df[!duplicated(loc_df$precedesID),]
> 
> prec <- txbg[ as.character(loc_df$precedesID) ]
> foll <- txbg[ as.character(loc_df$followsID) ]
> 
> for( i in 1:length(names(prec)) ){
+ np = names(prec)[i]
+ nf = names(foll)[i]
+ p = prec[[np]]
+ f = foll[[nf]]
+ p1 = p[1]
+ f1 = f[1]
+ #print( paste( loc_df$start[i], loc_df$end[i], "|", p1 at seqnames, p1 at strand, p1 at ranges@start, p1 at ranges@start + p1 at ranges@width, "vs.", f1 at seqnames, f1 at strand, f1 at ranges@start, f1 at ranges@start + f1 at ranges@width ) )
+ loc_df$pstrand[i] = as.character(p1 at strand)
+ loc_df$pstart[i] = as.numeric(p1 at ranges@start)
+ loc_df$pend[i] = as.numeric(p1 at ranges@start) + as.numeric(p1 at ranges@width)
+ loc_df$fstrand[i] = as.character(f1 at strand)
+ loc_df$fstart[i] = as.numeric(f1 at ranges@start)
+ loc_df$fend[i] = as.numeric(f1 at ranges@start) + as.numeric(f1 at ranges@width)
+ }
> 
> loc_df[, c("pstrand","pstart","pend","start","end","fstrand","fstart","fend") ]
            pstrand pstart   pend  start    end fstrand fstart   fend
rs62637812        -  34611  36082  51803  51803       - 136698 140567
1:54586           +  69091  70009  54586  54586       - 136698 140567
1:145820          - 136698 140567 145820 145820       +  69091  70009
1:231454          + 322037 326939 231454 231454       +  69091  70009
1:333714          + 367659 368598 333714 333714       + 323892 328582
1:467095          + 763064 788903 467095 467095       + 367659 368598
1:715857          - 700245 714069 715857 715857       - 761586 762903
rs1044922         - 761586 762903 792263 792263       + 763064 788903
rs7545373         - 803451 812183 812284 812284       + 763064 788903
rs192553893       + 860530 871277 839873 839873       - 852953 854818
rs6673914         - 852953 854818 855075 855075       - 879583 893919
1:919987          - 910579 912022 919987 919987       + 901877 910485
rs2488989         + 948847 949920 934121 934121       - 934342 935553
rs3128114         - 934342 935553 935715 935715       + 901877 910485
rs9442388         + 955503 991500 951295 951295       + 948847 949920

> sessionInfo()
R Under development (unstable) (2012-04-16 r59046)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
 [2] GenomicFeatures_1.8.1                  
 [3] AnnotationDbi_1.18.0                   
 [4] Biobase_2.16.0                         
 [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17     
 [6] BSgenome_1.24.0                        
 [7] VariantAnnotation_1.2.5                
 [8] Rsamtools_1.8.3                        
 [9] Biostrings_2.24.1                      
[10] GenomicRanges_1.8.3                    
[11] IRanges_1.14.2                         
[12] BiocGenerics_0.2.0                     

loaded via a namespace (and not attached):
 [1] biomaRt_2.12.0     bitops_1.0-4.1     DBI_0.2-5          grid_2.16.0       
 [5] lattice_0.20-6     Matrix_1.0-6       RCurl_1.91-1       RSQLite_0.11.1    
 [9] rtracklayer_1.16.1 snpStats_1.6.0     splines_2.16.0     stats4_2.16.0     
[13] survival_2.36-12   tools_2.16.0       XML_3.9-4          zlibbioc_1.2.0