[BioC] problems with strand in predictCoding

Jeremiah Degenhardt degenhardt.jeremiah at gene.com
Fri Apr 20 03:46:26 CEST 2012


Hello BioC,

I am using the predictCoding function in the VariantAnnotation package
and have run into a few issues.

The first issue that I found is a small one. While I can supply any
txdb object that I want to the predictCoding, the function seems to
still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running
predict coding without this package installed results in the following
error:

Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) :
  error in evaluating the argument 'x' in selecting a method for
function 'isCircular': Error: object
'TxDb.Hsapiens.UCSC.hg19.knownGene' not found

The second set of issues are a little more subtle. It seems that
predictCoding is not properly handling the strand of the variants, at
least in my opinion.

If I provide an unstranded GRanges with variants that overlap a gene
on the negative strand, the predictCoding function will not reverse
complement the varAllele resulting in an incorrect functional
consequence prediction. Here is some code to generate an example

library(VariantAnnotation)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1),
variant = DNAStringSet("T"))
predictCoding(GR, txdb, Hsapiens, values(GR)$variant)

GRanges with 4 ranges and 13 elementMetadata cols:
      seqnames               ranges strand |        variant
varAllele
         <Rle>            <IRanges>  <Rle> | <DNAStringSet>
<DNAStringSet>
  [1]    chr17 [63554271, 63554271]      * |              T
  T
  [2]    chr17 [63554271, 63554271]      * |              T
  T
  [3]    chr17 [63554271, 63554271]      * |              T
  T
  [4]    chr17 [63554271, 63554271]      * |              T
  T
          cdsLoc              proteinLoc   queryID        txID
cdsID
       <IRanges> <CompressedIntegerList> <integer> <character>
<integer>
  [1] [468, 468]                     156         1       65536
193561
  [2] [468, 468]                     156         1       65537
193561
  [3] [468, 468]                     156         1       65538
193561
  [4] [468, 468]                     156         1       65539
193564
           geneID consequence       refCodon       varCodon
refAA
      <character>    <factor> <DNAStringSet> <DNAStringSet>
<AAStringSet>
  [1]        8313  synonymous            TAC            TAT
 Y
  [2]        8313  synonymous            TAC            TAT
 Y
  [3]        8313  synonymous            TAC            TAT
 Y
  [4]        8313  synonymous            TAC            TAT
 Y
              varAA
      <AAStringSet>
  [1]             Y
  [2]             Y
  [3]             Y
  [4]             Y
  ---
  seqlengths:
   chr17
      NA

Running this the function predicts this to be TAC> TAT, a synonymous
change. However, as the gene is on the negative strand the actual
result of this variant should be TAC> TAA or a stop mutation.

If I instead give predictCoding a stranded GRanges to let the function
know that the base reported is with respect to the positive strand I
get this result:

> GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), strand="+", variant = DNAStringSet("T"))
> predictCoding(GR, txdb, Hsapiens, values(GR)$variant)
GRanges with 0 ranges and 0 elementMetadata cols:

   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  ---
  seqlengths:


Now instead of reverse complementing the varAllele, and giving me the
correct annotation, the function simply misses that the mutation
overlaps with any gene and returns an empty GRanges.

This issue is not really a bug in predictCoding, but results from what
is (again in my opinion) an unfortunate treatment of the negative and
positive strand as separate entities in GRanges. From a biological
prospective this almost never makes sense and is not how I would
expect the functions in GRanges to behave. Just because a gene is
annotated on the negative strand does not mean that I don't want to
know it overlaps with a gene on the positive strand. Maybe there are
cases where this is the desired behavior, so it might make sense to
add this a switch you could turn on or off but in most cases this will
simply lead to erroneous results such as the one above. I think the
most useful information that comes from knowing the strand is knowing
when you need to reverse complement a sequence for the underlying
position of interest.

A third possible option here would be to call the function twice, once
with for the positive strand and once for the negative strand with the
variants reverse complemented and then merge the results. This
however, seems quite inefficient and not like something that I should
have to do.

>From my perspective the proper behavior of predictCoding with respect
to strand would be to treat unstranded GRanges as positive strand as
this is the reference strand for things like the genome builds and vcf
files. Then the variants should overlap positive and negative strand
genes and should be reverse complemented for consequence prediction on
the negative strand genes. It should also allow overlap of negative
and positive stranded variants with both negative and positive
stranded genes, but properly reverse complement the variant in each
case to get the proper consequence.

thanks in advance,

Jeremiah

here is my session info incase it is needed:

> sessionInfo()
R Under development (unstable) (2012-04-11 r58985)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
 [2] GenomicFeatures_1.8.1
 [3] AnnotationDbi_1.18.0
 [4] Biobase_2.16.0
 [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17
 [6] BSgenome_1.24.0
 [7] VariantAnnotation_1.2.5
 [8] Rsamtools_1.8.1
 [9] Biostrings_2.24.1
[10] GenomicRanges_1.8.3
[11] IRanges_1.14.2
[12] BiocGenerics_0.2.0

loaded via a namespace (and not attached):
 [1] biomaRt_2.12.0     bitops_1.0-4.1     compiler_2.16.0
DBI_0.2-5
 [5] grid_2.16.0        lattice_0.20-6     Matrix_1.0-6
RCurl_1.91-1
 [9] RSQLite_0.11.1     rtracklayer_1.16.1 snpStats_1.6.0
splines_2.16.0
[13] stats4_2.16.0      survival_2.36-12   tools_2.16.0
XML_3.9-4
[17] zlibbioc_1.2.0

--
Jeremiah Degenhardt, Ph.D.
Computational Biologist
Bioinformatics and Computational Biology
Genentech, Inc.
degenhardt.jeremiah at gene.com



More information about the Bioconductor mailing list