[BioC] predictCoding() return empty Granges

Fri Oct 5 08:36:24 CEST 2012

So the good news is that TAIR10 is the same assembly as TAIR9.
See last line in:

ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/README_whole_chromosomes.txt

We don't need to add a TAIR10 BSgenome package to Bioconductor.
You can use TAIR9.

Cheers,
H.

On 10/04/2012 11:01 PM, Hervé Pagès wrote:
> Hi Rebecca,
>
> On 10/04/2012 02:44 PM, sun wrote:
>> After I successfully adjusted the chromosomes of VCF and TranscriptDb
>> objects to match the names of the BSgenome object, I ran predictCoding(),
>> it returned nothing except a warning message, see below,
>>
>>> coding <- predictCoding(vcf, txdb, seqSource=Athaliana)
>> *Warning message:
>> In .setActiveSubjectSeq(query, subject) :
>>    circular sequence(s) in query 'chrM' ignored*
>>> coding
>> GRanges with 0 ranges and 1 metadata column:
>>     seqnames    ranges strand |      GENEID
>>        <Rle> <IRanges>  <Rle> | <character>
>>    ---
>>    seqlengths:
>>
>> Here are the values of  VCF, TranscriptDb and BSgenome Object,
>>
>>> vcf
>> class: VCF
>> dim: 551201 2
>> genome: TAIR10
>> exptData(1): header
>> fixed(4): REF ALT QUAL FILTER
>> info(18): DP DP4 ... PR VDB
>> geno(6): GT GQ ... SP PL
>> rownames(551201): Chr1:711 Chr1:956 ... ChrM:366919 ChrM:366920
>> *Qustion:
>> After used renameSeqlevels() to change chr name of vcf (Chr to chr), the
>> rownames here are still old chr name. not sure if this will be the
>> problem
>> for predictCoding(). *
>> rowData values names(1): paramRangeID
>> colnames(2): sample1_aln.sorted.bam sample2_aln.sorted.bam
>> colData names(1): Samples
>>
>>> txdb
>> TranscriptDb object:
>> | Db type: TranscriptDb
>> | Supporting package: GenomicFeatures
>> | Data source: TAIR 10
>> | Genus and Species: Arabodpsis
>> | miRBase build ID: NA
>> | transcript_nrow: 35386
>> | exon_nrow: 207465
>> | cds_nrow: 197160
>> | Db created by: GenomicFeatures package from Bioconductor
>> | Creation time: 2012-09-28 16:43:28 -0700 (Fri, 28 Sep 2012)
>> | GenomicFeatures version at creation time: 1.9.37
>> | RSQLite version at creation time: 0.11.1
>> | DBSCHEMAVERSION: 1.0
>>
>>> Athaliana
>> Arabidopsis genome
>> |
>> | organism: Arabidopsis thaliana (Arabidopsis)
>> | provider: TAIR
>> | provider version: 04232008
>> | release date: NA
>> | release name: dumped from ADB: Mar/14/08
>> |
>> | sequences (see '?seqnames'):
>> |   chr1  chr2  chr3  chr4  chr5  chrC  chrM
>> |
>> | (use the '$' or '[[' operator to access a given sequence)
>
> We provide 2 BSgenome packages for Arabidopsis: a very old one
>
>    BSgenome.Athaliana.TAIR.04232008
>
> with chromosome names and lengths:
>
>    > seqlengths(Athaliana)
>        chr1     chr2     chr3     chr4     chr5     chrC     chrM
>    30432563 19705359 23470805 18585042 26992728   154478   366924
>
> and a somewhat less old one:
>
>    BSgenome.Athaliana.TAIR.TAIR9
>
> with chromosome names and lengths:
>
>    > seqlengths(Athaliana)
>        Chr1     Chr2     Chr3     Chr4     Chr5     ChrM     ChrC
>    30427671 19698289 23459830 18585056 26975502   366924   154478
>
> They correspond to 2 different assemblies of course.
>
> 2 things:
>
> (1) The code I showed you in the previous thread for renaming
>      the sequence names of your TranscriptDb object was assuming
>      that you were using the less old BSgenome package:
>
>        seqlevels(txdb) <- sub("^c", "C", seqlevels(txdb))
>
>      However it seems that you are using the very old one (where
>      chr names are "chr*"), and that your txdb chr names are
>      "Chr*", so you would actually need to do something like:
>
>        seqlevels(txdb) <- sub("^C", "c", seqlevels(txdb))
>
> (2) However, and this is much more important than (1): it seems that
>      your VCF and TranscriptDb objects are based on the TAIR10 assembly.
>      So it makes no sense to use anything else but a BSgenome
>      package that contains the exact same assembly. Otherwise it's
>      very likely that most of the predictions returned by
>      predictCoding() will be wrong!
>
> Unfortunately at the moment we don't provide a BSgenome package for
> TAIR10 (nobody requested one so far). I'll try to make one in the
> next couple of weeks. I'll also remove
> BSgenome.Athaliana.TAIR.04232008 from the devel branch (BioC 2.12).
>
> Cheers,
> H.
>
>
>>
>>> sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>   [1] BSgenome.Athaliana.TAIR.04232008_1.3.18
>> BSgenome_1.26.0
>> VariantAnnotation_1.4.0
>>   [4] Rsamtools_1.10.1
>> Biostrings_2.26.0
>> GenomicFeatures_1.10.0
>>   [7] AnnotationDbi_1.20.0
>> Biobase_2.18.0
>> GenomicRanges_1.10.0
>> [10] IRanges_1.16.2
>> BiocGenerics_0.4.0
>> vimcom_0.9-2
>> [13] setwidth_1.0-0
>> colorout_0.9-9
>>
>> loaded via a namespace (and not attached):
>>   [1] DBI_0.2-5          RCurl_1.95-0.1.2   RSQLite_0.11.2
>> XML_3.95-0.1       biomaRt_2.14.0     bitops_1.0-4.1
>>   [7] parallel_2.15.1    rtracklayer_1.18.0 stats4_2.15.1
>> tools_2.15.1       zlibbioc_1.4.0
>>
>> any ideas? thank you.
>>
>> Rebecca
>>
>>
>> On Thu, Oct 4, 2012 at 1:18 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>
>>> Hi Rebecca,
>>>
>>>
>>> On 10/04/2012 12:10 PM, sun wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am going to use "coding <- predictCoding(vcf, txdb,
>>>> seqSource=Athaliana)"
>>>> to detect coding SNPs. The problem is that the chromosome names are not
>>>> consistent among VCF, txdb and BSgenome. In vcf, the chromosome name is
>>>> "Chr*", in txdb, the chr name is "Chr", but in BSgenome, the chr
>>>> name is
>>>> "chr*" .
>>>>
>>>> I know I can use renameSeqlevels() to adjust the seqlevels (chromosome
>>>> names) of the VCF object to match that of the txdb annotation. But
>>>> how can
>>>> I adjust the chr name of BSgenome or TranscriptDB?
>>>>
>>>
>>> In BioC 2.11 (released yesterday), you can rename the chromosomes of a
>>> TranscriptDb object, so you could rename the chromosomes of your
>>> VCF and TranscriptDb objects to match the names of the BSgenome object.
>>>
>>> E.g. for the TranscriptDb object:
>>>
>>>    seqlevels(txdb) <- sub("^c", "C", seqlevels(txdb))
>>>
>>> Note that renaming the chromosomes of a TranscriptDb object is a new
>>> feature and is not fully implemented yet. For example, if you use
>>> select() on the object you'll still get the original names (those
>>> stored in the db), and if you try to specify a chromosome name thru
>>> the 'vals' arg of the transcripts(), exons() and cds() extractors,
>>> you still need to use the original names. This will be addressed soon.
>>>
>>> Our plan is to also support renaming of the chromosomes of BSgenome
>>> and SNPlocs objects very soon.
>>>
>>> Also, an additional level of convenience will be provided via the
>>> seqnameStyle() getter and setter, so you'll be able to quickly rename
>>> with something like:
>>>
>>>    seqnameStyle(x) <- "UCSC"
>>>
>>> or
>>>
>>>    seqnameStyle(vcf) <- seqnameStyle(txdb) <- seqnameStyle(genome)
>>>
>>> This will work on almost any 'x' object that contains chromosome
>>> names (GRanges, GRangesList, GappedAlignments, TranscriptDb, VCF,
>>> BSgenome, SNPlocs, etc...)
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>>
>>>
>>>> Thanks,
>>>>
>>>> Rebecca
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>
>>>> Search the archives: http://news.gmane.org/gmane.**
>>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319