[BioC] problems with strand in predictCoding
Martin Morgan
mtmorgan at fhcrc.org
Fri Apr 20 19:03:43 CEST 2012
On 04/20/2012 07:28 AM, Tim Triche, Jr. wrote:
> I meant to put this in my initial reply, but I firmly agree with Michael
> Lawrence.
>
> The SummarizedExperiment class is a great thing, and subsetting it by (say)
> my.SE[ a.GRanges, some.patients ] is unbelievably convenient, especially
> if it needs to be done repeatedly, or by items in a GRangesList. The
> strand issue is a nasty gotcha sometimes.
Hi Tim -- kind of a different thread here, but I wanted to say that yes,
we've heard you. A problem is the many-to-many mapping that's possible,
for instance for some idx, we really expect nrow(x[idx,]) ==
length(idx); there are additional complexities introduced by, e.g., a
GRangesList and in the end the convenience for some use cases didn't
seem to outweigh the ambiguity and complexity for the general case.
Martin
>
> just MHO though.
>
>
>
> On Fri, Apr 20, 2012 at 7:25 AM, Tim Triche, Jr.<tim.triche at gmail.com>wrote:
>
>>> a Bioc Inferno book
>>
>> This is one of the better ideas to be bandied about lately...
>>
>>
>>
>>
>> On Fri, Apr 20, 2012 at 5:43 AM, Michael Lawrence<
>> lawrence.michael at gene.com> wrote:
>>
>>> On Thu, Apr 19, 2012 at 6:46 PM, Jeremiah Degenhardt<
>>> degenhardt.jeremiah at gene.com> wrote:
>>>
>>>> Hello BioC,
>>>>
>>>> I am using the predictCoding function in the VariantAnnotation package
>>>> and have run into a few issues.
>>>>
>>>> The first issue that I found is a small one. While I can supply any
>>>> txdb object that I want to the predictCoding, the function seems to
>>>> still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running
>>>> predict coding without this package installed results in the following
>>>> error:
>>>>
>>>> Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) :
>>>> error in evaluating the argument 'x' in selecting a method for
>>>> function 'isCircular': Error: object
>>>> 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found
>>>>
>>>> The second set of issues are a little more subtle. It seems that
>>>> predictCoding is not properly handling the strand of the variants, at
>>>> least in my opinion.
>>>>
>>>> If I provide an unstranded GRanges with variants that overlap a gene
>>>> on the negative strand, the predictCoding function will not reverse
>>>> complement the varAllele resulting in an incorrect functional
>>>> consequence prediction. Here is some code to generate an example
>>>>
>>>> library(VariantAnnotation)
>>>> library(BSgenome.Hsapiens.UCSC.hg19)
>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>> txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1),
>>>> variant = DNAStringSet("T"))
>>>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant)
>>>>
>>>> GRanges with 4 ranges and 13 elementMetadata cols:
>>>> seqnames ranges strand | variant
>>>> varAllele
>>>> <Rle> <IRanges> <Rle> |<DNAStringSet>
>>>> <DNAStringSet>
>>>> [1] chr17 [63554271, 63554271] * | T
>>>> T
>>>> [2] chr17 [63554271, 63554271] * | T
>>>> T
>>>> [3] chr17 [63554271, 63554271] * | T
>>>> T
>>>> [4] chr17 [63554271, 63554271] * | T
>>>> T
>>>> cdsLoc proteinLoc queryID txID
>>>> cdsID
>>>> <IRanges> <CompressedIntegerList> <integer> <character>
>>>> <integer>
>>>> [1] [468, 468] 156 1 65536
>>>> 193561
>>>> [2] [468, 468] 156 1 65537
>>>> 193561
>>>> [3] [468, 468] 156 1 65538
>>>> 193561
>>>> [4] [468, 468] 156 1 65539
>>>> 193564
>>>> geneID consequence refCodon varCodon
>>>> refAA
>>>> <character> <factor> <DNAStringSet> <DNAStringSet>
>>>> <AAStringSet>
>>>> [1] 8313 synonymous TAC TAT
>>>> Y
>>>> [2] 8313 synonymous TAC TAT
>>>> Y
>>>> [3] 8313 synonymous TAC TAT
>>>> Y
>>>> [4] 8313 synonymous TAC TAT
>>>> Y
>>>> varAA
>>>> <AAStringSet>
>>>> [1] Y
>>>> [2] Y
>>>> [3] Y
>>>> [4] Y
>>>> ---
>>>> seqlengths:
>>>> chr17
>>>> NA
>>>>
>>>> Running this the function predicts this to be TAC> TAT, a synonymous
>>>> change. However, as the gene is on the negative strand the actual
>>>> result of this variant should be TAC> TAA or a stop mutation.
>>>>
>>>> If I instead give predictCoding a stranded GRanges to let the function
>>>> know that the base reported is with respect to the positive strand I
>>>> get this result:
>>>>
>>>>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1),
>>>> strand="+", variant = DNAStringSet("T"))
>>>>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant)
>>>> GRanges with 0 ranges and 0 elementMetadata cols:
>>>>
>>>> seqnames ranges strand
>>>> <Rle> <IRanges> <Rle>
>>>> ---
>>>> seqlengths:
>>>>
>>>>
>>>> Now instead of reverse complementing the varAllele, and giving me the
>>>> correct annotation, the function simply misses that the mutation
>>>> overlaps with any gene and returns an empty GRanges.
>>>>
>>>> This issue is not really a bug in predictCoding, but results from what
>>>> is (again in my opinion) an unfortunate treatment of the negative and
>>>> positive strand as separate entities in GRanges. From a biological
>>>> prospective this almost never makes sense and is not how I would
>>>> expect the functions in GRanges to behave. Just because a gene is
>>>> annotated on the negative strand does not mean that I don't want to
>>>> know it overlaps with a gene on the positive strand. Maybe there are
>>>> cases where this is the desired behavior, so it might make sense to
>>>> add this a switch you could turn on or off but in most cases this will
>>>> simply lead to erroneous results such as the one above.
>>>
>>>
>>> There is an "ignore.strand" argument to findOverlaps, so we have a switch.
>>> I have always thought that strand should be ignored by default in
>>> operations like overlap detection, and only considered as a "direction"
>>> rather than as separate in space. It's very useful for resize() and
>>> flank()
>>> to consider strand, but not so useful for findOverlaps. The
>>> ignore.strand=FALSE in those cases default would qualify for the eight
>>> circle if there were a Bioc Inferno book. It's only the default that I
>>> argue with though, having the capability to consider strand is useful.
>>>
>>>
>>>> I think the
>>>> most useful information that comes from knowing the strand is knowing
>>>> when you need to reverse complement a sequence for the underlying
>>>> position of interest.
>>>>
>>>> A third possible option here would be to call the function twice, once
>>>> with for the positive strand and once for the negative strand with the
>>>> variants reverse complemented and then merge the results. This
>>>> however, seems quite inefficient and not like something that I should
>>>> have to do.
>>>>
>>>> > From my perspective the proper behavior of predictCoding with respect
>>>> to strand would be to treat unstranded GRanges as positive strand as
>>>> this is the reference strand for things like the genome builds and vcf
>>>> files. Then the variants should overlap positive and negative strand
>>>> genes and should be reverse complemented for consequence prediction on
>>>> the negative strand genes. It should also allow overlap of negative
>>>> and positive stranded variants with both negative and positive
>>>> stranded genes, but properly reverse complement the variant in each
>>>> case to get the proper consequence.
>>>>
>>>> thanks in advance,
>>>>
>>>> Jeremiah
>>>>
>>>> here is my session info incase it is needed:
>>>>
>>>>> sessionInfo()
>>>> R Under development (unstable) (2012-04-11 r58985)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=C LC_NAME=C
>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> other attached packages:
>>>> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
>>>> [2] GenomicFeatures_1.8.1
>>>> [3] AnnotationDbi_1.18.0
>>>> [4] Biobase_2.16.0
>>>> [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17
>>>> [6] BSgenome_1.24.0
>>>> [7] VariantAnnotation_1.2.5
>>>> [8] Rsamtools_1.8.1
>>>> [9] Biostrings_2.24.1
>>>> [10] GenomicRanges_1.8.3
>>>> [11] IRanges_1.14.2
>>>> [12] BiocGenerics_0.2.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] biomaRt_2.12.0 bitops_1.0-4.1 compiler_2.16.0
>>>> DBI_0.2-5
>>>> [5] grid_2.16.0 lattice_0.20-6 Matrix_1.0-6
>>>> RCurl_1.91-1
>>>> [9] RSQLite_0.11.1 rtracklayer_1.16.1 snpStats_1.6.0
>>>> splines_2.16.0
>>>> [13] stats4_2.16.0 survival_2.36-12 tools_2.16.0
>>>> XML_3.9-4
>>>> [17] zlibbioc_1.2.0
>>>>
>>>> --
>>>> Jeremiah Degenhardt, Ph.D.
>>>> Computational Biologist
>>>> Bioinformatics and Computational Biology
>>>> Genentech, Inc.
>>>> degenhardt.jeremiah at gene.com
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>>
>> --
>> *A model is a lie that helps you see the truth.*
>> *
>> *
>> Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>>
>
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioconductor
mailing list