[BioC] VariantAnnotation locateVariants - meaning of precedesID and followsID
Paul Theodor Pyl
paul.theodor.pyl at embl.de
Tue Apr 17 17:20:35 CEST 2012
Hi,
I have a question concerning the meaning of precedesID and followsID in
the result of the locateVariants function from the VariantAnnotation
package.
The reference manual states:
precedesID: The id of the gene the query precedes. Only present for
‘intergenic’ variants.
followsID: The id of the gene the query follows. Only present for
‘intergenic’ variants.
Which I read as: The precedesID is the ID of the gene that follows the
query i.e. the gene which is preceded by the query. The query in this
case would be my snv position?!
I assume that the order of elements on the chromosome would be:
gene with ID followsID, query(SNV), gene with ID precedesID.
In the return value of locateVariants I find different scenarios, an
example is given in the attached file, I show a table here:
pstrand pstart pend start end fstrand fstart fend
rs62637812 - 34611 36082 51803 51803 - 136698 140567
1:54586 + 69091 70009 54586 54586 - 136698 140567
1:145820 - 136698 140567 145820 145820 + 69091 70009
1:231454 + 322037 326939 231454 231454 + 69091 70009
1:333714 + 367659 368598 333714 333714 + 323892 328582
1:467095 + 763064 788903 467095 467095 + 367659 368598
1:715857 - 700245 714069 715857 715857 - 761586 762903
rs1044922 - 761586 762903 792263 792263 + 763064 788903
rs7545373 - 803451 812183 812284 812284 + 763064 788903
rs192553893 + 860530 871277 839873 839873 - 852953 854818
rs6673914 - 852953 854818 855075 855075 - 879583 893919
1:919987 - 910579 912022 919987 919987 + 901877 910485
rs2488989 + 948847 949920 934121 934121 - 934342 935553
rs3128114 - 934342 935553 935715 935715 + 901877 910485
rs9442388 + 955503 991500 951295 951295 + 948847 949920
I got the coordinates from the transcriptDb object I used for the call
to locateVariants via the transcriptsBy(..., "gene") function.
I see cases with both genes upstream of the variant, or with their
positions switched etc.
Could someone help me make sense of this, I can't see an obvious pattern
explaining this.
Cheers,
Paul
-------------- next part --------------
> library(VariantAnnotation)
[...]
> library(BSgenome.Hsapiens.UCSC.hg19)
[...]
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
[...]
> subst <- c("RSID", "PREDICTION", "SCORE", "AACHANGE", "PROTEINID")
> tab <- TabixFile( "../my.vcf.gz" )
> seqlens <- seqlengths( BSgenome.Hsapiens.UCSC.hg19::Hsapiens )[1:24]
> names(seqlens) <- paste( c(1:22,"X","Y") )
> which <- RangesList("1"=IRanges(50000, 1000000))
> svp = ScanVcfParam(which=which)
> vcf = readVcf(tab, "hg19", svp )
> rename <- paste("chr",seqlevels(vcf),sep="")
> names(rename) <- seqlevels(vcf)
> vcf <- renameSeqlevels(vcf, rename)
> txdb19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
> txdb19
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: UCSC
| Genome: hg19
| Genus and Species: Homo sapiens
| UCSC Table: knownGene
| Resource URL: http://genome.ucsc.edu/
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| miRBase build ID: GRCh37
| transcript_nrow: 80922
| exon_nrow: 286852
| cds_nrow: 235842
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-03-12 21:45:23 -0700 (Mon, 12 Mar 2012)
| GenomicFeatures version at creation time: 1.7.30
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0
> rd = rowData(vcf)
> rd
GRanges with 1900 ranges and 1 elementMetadata col:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
rs62637812 chr1 [51803, 51803] * | <NA>
rs76402894 chr1 [51898, 51898] * | <NA>
rs78732933 chr1 [51928, 51928] * | <NA>
rs62637813 chr1 [52058, 52058] * | <NA>
1:52096 chr1 [52096, 52096] * | <NA>
1:54586 chr1 [54586, 54586] * | <NA>
1:54844 chr1 [54844, 54844] * | <NA>
1:58176 chr1 [58176, 58176] * | <NA>
1:58211 chr1 [58211, 58211] * | <NA>
... ... ... ... ... ...
rs56255212 chr1 [985449, 985449] * | <NA>
1:985450 chr1 [985450, 985450] * | <NA>
rs3121562 chr1 [991724, 991724] * | <NA>
rs113866178 chr1 [991931, 991931] * | <NA>
rs2710886 chr1 [992084, 992084] * | <NA>
rs2799074 chr1 [992094, 992094] * | <NA>
1:995121 chr1 [995121, 995121] * | <NA>
1:995125 chr1 [995125, 995125] * | <NA>
rs28397086 chr1 [997408, 997408] * | <NA>
---
seqlengths:
chr1
NA
> loc = locateVariants(vcf, txdb19, AllVariants())
> loc_df <- as.data.frame(loc)
>
> txbg <- transcriptsBy(txdb19, "gene")
>
> loc_df <- loc_df[ is.finite(loc_df$precedesID), c("seqnames","start","end","precedesID","followsID")]
> loc_df <- loc_df[!duplicated(loc_df$precedesID),]
>
> prec <- txbg[ as.character(loc_df$precedesID) ]
> foll <- txbg[ as.character(loc_df$followsID) ]
>
> for( i in 1:length(names(prec)) ){
+ np = names(prec)[i]
+ nf = names(foll)[i]
+ p = prec[[np]]
+ f = foll[[nf]]
+ p1 = p[1]
+ f1 = f[1]
+ #print( paste( loc_df$start[i], loc_df$end[i], "|", p1 at seqnames, p1 at strand, p1 at ranges@start, p1 at ranges@start + p1 at ranges@width, "vs.", f1 at seqnames, f1 at strand, f1 at ranges@start, f1 at ranges@start + f1 at ranges@width ) )
+ loc_df$pstrand[i] = as.character(p1 at strand)
+ loc_df$pstart[i] = as.numeric(p1 at ranges@start)
+ loc_df$pend[i] = as.numeric(p1 at ranges@start) + as.numeric(p1 at ranges@width)
+ loc_df$fstrand[i] = as.character(f1 at strand)
+ loc_df$fstart[i] = as.numeric(f1 at ranges@start)
+ loc_df$fend[i] = as.numeric(f1 at ranges@start) + as.numeric(f1 at ranges@width)
+ }
>
> loc_df[, c("pstrand","pstart","pend","start","end","fstrand","fstart","fend") ]
pstrand pstart pend start end fstrand fstart fend
rs62637812 - 34611 36082 51803 51803 - 136698 140567
1:54586 + 69091 70009 54586 54586 - 136698 140567
1:145820 - 136698 140567 145820 145820 + 69091 70009
1:231454 + 322037 326939 231454 231454 + 69091 70009
1:333714 + 367659 368598 333714 333714 + 323892 328582
1:467095 + 763064 788903 467095 467095 + 367659 368598
1:715857 - 700245 714069 715857 715857 - 761586 762903
rs1044922 - 761586 762903 792263 792263 + 763064 788903
rs7545373 - 803451 812183 812284 812284 + 763064 788903
rs192553893 + 860530 871277 839873 839873 - 852953 854818
rs6673914 - 852953 854818 855075 855075 - 879583 893919
1:919987 - 910579 912022 919987 919987 + 901877 910485
rs2488989 + 948847 949920 934121 934121 - 934342 935553
rs3128114 - 934342 935553 935715 935715 + 901877 910485
rs9442388 + 955503 991500 951295 951295 + 948847 949920
> sessionInfo()
R Under development (unstable) (2012-04-16 r59046)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
[2] GenomicFeatures_1.8.1
[3] AnnotationDbi_1.18.0
[4] Biobase_2.16.0
[5] BSgenome.Hsapiens.UCSC.hg19_1.3.17
[6] BSgenome_1.24.0
[7] VariantAnnotation_1.2.5
[8] Rsamtools_1.8.3
[9] Biostrings_2.24.1
[10] GenomicRanges_1.8.3
[11] IRanges_1.14.2
[12] BiocGenerics_0.2.0
loaded via a namespace (and not attached):
[1] biomaRt_2.12.0 bitops_1.0-4.1 DBI_0.2-5 grid_2.16.0
[5] lattice_0.20-6 Matrix_1.0-6 RCurl_1.91-1 RSQLite_0.11.1
[9] rtracklayer_1.16.1 snpStats_1.6.0 splines_2.16.0 stats4_2.16.0
[13] survival_2.36-12 tools_2.16.0 XML_3.9-4 zlibbioc_1.2.0
More information about the Bioconductor
mailing list