[BioC] predictCoding() return empty Granges

Fri Oct 5 08:01:26 CEST 2012

Hi Rebecca,

On 10/04/2012 02:44 PM, sun wrote:
> After I successfully adjusted the chromosomes of VCF and TranscriptDb
> objects to match the names of the BSgenome object, I ran predictCoding(),
> it returned nothing except a warning message, see below,
>
>> coding <- predictCoding(vcf, txdb, seqSource=Athaliana)
> *Warning message:
> In .setActiveSubjectSeq(query, subject) :
>    circular sequence(s) in query 'chrM' ignored*
>> coding
> GRanges with 0 ranges and 1 metadata column:
>     seqnames    ranges strand |      GENEID
>        <Rle> <IRanges>  <Rle> | <character>
>    ---
>    seqlengths:
>
> Here are the values of  VCF, TranscriptDb and BSgenome Object,
>
>> vcf
> class: VCF
> dim: 551201 2
> genome: TAIR10
> exptData(1): header
> fixed(4): REF ALT QUAL FILTER
> info(18): DP DP4 ... PR VDB
> geno(6): GT GQ ... SP PL
> rownames(551201): Chr1:711 Chr1:956 ... ChrM:366919 ChrM:366920  *Qustion:
> After used renameSeqlevels() to change chr name of vcf (Chr to chr), the
> rownames here are still old chr name. not sure if this will be the problem
> for predictCoding(). *
> rowData values names(1): paramRangeID
> colnames(2): sample1_aln.sorted.bam sample2_aln.sorted.bam
> colData names(1): Samples
>
>> txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Supporting package: GenomicFeatures
> | Data source: TAIR 10
> | Genus and Species: Arabodpsis
> | miRBase build ID: NA
> | transcript_nrow: 35386
> | exon_nrow: 207465
> | cds_nrow: 197160
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2012-09-28 16:43:28 -0700 (Fri, 28 Sep 2012)
> | GenomicFeatures version at creation time: 1.9.37
> | RSQLite version at creation time: 0.11.1
> | DBSCHEMAVERSION: 1.0
>
>> Athaliana
> Arabidopsis genome
> |
> | organism: Arabidopsis thaliana (Arabidopsis)
> | provider: TAIR
> | provider version: 04232008
> | release date: NA
> | release name: dumped from ADB: Mar/14/08
> |
> | sequences (see '?seqnames'):
> |   chr1  chr2  chr3  chr4  chr5  chrC  chrM
> |
> | (use the '$' or '[[' operator to access a given sequence)

We provide 2 BSgenome packages for Arabidopsis: a very old one

   BSgenome.Athaliana.TAIR.04232008

with chromosome names and lengths:

   > seqlengths(Athaliana)
       chr1     chr2     chr3     chr4     chr5     chrC     chrM
   30432563 19705359 23470805 18585042 26992728   154478   366924

and a somewhat less old one:

   BSgenome.Athaliana.TAIR.TAIR9

with chromosome names and lengths:

   > seqlengths(Athaliana)
       Chr1     Chr2     Chr3     Chr4     Chr5     ChrM     ChrC
   30427671 19698289 23459830 18585056 26975502   366924   154478

They correspond to 2 different assemblies of course.

2 things:

(1) The code I showed you in the previous thread for renaming
     the sequence names of your TranscriptDb object was assuming
     that you were using the less old BSgenome package:

       seqlevels(txdb) <- sub("^c", "C", seqlevels(txdb))

     However it seems that you are using the very old one (where
     chr names are "chr*"), and that your txdb chr names are
     "Chr*", so you would actually need to do something like:

       seqlevels(txdb) <- sub("^C", "c", seqlevels(txdb))

(2) However, and this is much more important than (1): it seems that
     your VCF and TranscriptDb objects are based on the TAIR10 assembly.
     So it makes no sense to use anything else but a BSgenome
     package that contains the exact same assembly. Otherwise it's
     very likely that most of the predictions returned by
     predictCoding() will be wrong!

Unfortunately at the moment we don't provide a BSgenome package for
TAIR10 (nobody requested one so far). I'll try to make one in the
next couple of weeks. I'll also remove
BSgenome.Athaliana.TAIR.04232008 from the devel branch (BioC 2.12).

Cheers,
H.

>
>> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] BSgenome.Athaliana.TAIR.04232008_1.3.18
> BSgenome_1.26.0
> VariantAnnotation_1.4.0
>   [4] Rsamtools_1.10.1
> Biostrings_2.26.0
> GenomicFeatures_1.10.0
>   [7] AnnotationDbi_1.20.0
> Biobase_2.18.0
> GenomicRanges_1.10.0
> [10] IRanges_1.16.2
> BiocGenerics_0.4.0
> vimcom_0.9-2
> [13] setwidth_1.0-0
> colorout_0.9-9
>
> loaded via a namespace (and not attached):
>   [1] DBI_0.2-5          RCurl_1.95-0.1.2   RSQLite_0.11.2
> XML_3.95-0.1       biomaRt_2.14.0     bitops_1.0-4.1
>   [7] parallel_2.15.1    rtracklayer_1.18.0 stats4_2.15.1
> tools_2.15.1       zlibbioc_1.4.0
>
> any ideas? thank you.
>
> Rebecca
>
>
> On Thu, Oct 4, 2012 at 1:18 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> Hi Rebecca,
>>
>>
>> On 10/04/2012 12:10 PM, sun wrote:
>>
>>> Hi All,
>>>
>>> I am going to use "coding <- predictCoding(vcf, txdb,
>>> seqSource=Athaliana)"
>>> to detect coding SNPs. The problem is that the chromosome names are not
>>> consistent among VCF, txdb and BSgenome. In vcf, the chromosome name is
>>> "Chr*", in txdb, the chr name is "Chr", but in BSgenome, the chr name is
>>> "chr*" .
>>>
>>> I know I can use renameSeqlevels() to adjust the seqlevels (chromosome
>>> names) of the VCF object to match that of the txdb annotation. But how can
>>> I adjust the chr name of BSgenome or TranscriptDB?
>>>
>>
>> In BioC 2.11 (released yesterday), you can rename the chromosomes of a
>> TranscriptDb object, so you could rename the chromosomes of your
>> VCF and TranscriptDb objects to match the names of the BSgenome object.
>>
>> E.g. for the TranscriptDb object:
>>
>>    seqlevels(txdb) <- sub("^c", "C", seqlevels(txdb))
>>
>> Note that renaming the chromosomes of a TranscriptDb object is a new
>> feature and is not fully implemented yet. For example, if you use
>> select() on the object you'll still get the original names (those
>> stored in the db), and if you try to specify a chromosome name thru
>> the 'vals' arg of the transcripts(), exons() and cds() extractors,
>> you still need to use the original names. This will be addressed soon.
>>
>> Our plan is to also support renaming of the chromosomes of BSgenome
>> and SNPlocs objects very soon.
>>
>> Also, an additional level of convenience will be provided via the
>> seqnameStyle() getter and setter, so you'll be able to quickly rename
>> with something like:
>>
>>    seqnameStyle(x) <- "UCSC"
>>
>> or
>>
>>    seqnameStyle(vcf) <- seqnameStyle(txdb) <- seqnameStyle(genome)
>>
>> This will work on almost any 'x' object that contains chromosome
>> names (GRanges, GRangesList, GappedAlignments, TranscriptDb, VCF,
>> BSgenome, SNPlocs, etc...)
>>
>> Cheers,
>> H.
>>
>>
>>
>>
>>> Thanks,
>>>
>>> Rebecca
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319