[BioC] What populates makeTranscriptDbFromBiomart?

Tue Apr 17 01:23:45 CEST 2012

One more thing that I neglected to mention is that you can learn details 
about your transcriptDb object by just looking at the output of it's 
show method.  So for example if I have loaded:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## then I can just look at the object like:
TxDb.Hsapiens.UCSC.hg19.knownGene

## And I see an output like this:
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: UCSC
| Genome: hg19
| Genus and Species: Homo sapiens
| UCSC Table: knownGene
| Resource URL: http://genome.ucsc.edu/
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| miRBase build ID: GRCh37
| transcript_nrow: 80922
| exon_nrow: 286852
| cds_nrow: 235842
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-03-12 21:45:23 -0700 (Mon, 12 Mar 2012)
| GenomicFeatures version at creation time: 1.7.30
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0

Which tells me a whole lot about where this data came from, what build 
of the genome it was based upon etc.  Now if I had looked at a package 
based on biomaRt it would look similar.  For example:

library("TxDb.Athaliana.BioMart.plantsmart12")
TxDb.Athaliana.BioMart.plantsmart12

## Shows me this:
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: BioMart
| Genus and Species: Arabidopsis thaliana
| Resource URL: www.biomart.org:80
| BioMart database: plants_mart_12
| BioMart database version: ENSEMBL PLANTS 12 (EBI UK)
| BioMart dataset: athaliana_eg_gene
| BioMart dataset description: Arabidopsis thaliana genes (TAIR10)
| BioMart dataset version: TAIR10
| Full dataset: yes
| miRBase build ID: NA
| transcript_nrow: 41671
| exon_nrow: 171013
| cds_nrow: 0
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-03-13 09:54:23 -0700 (Tue, 13 Mar 2012)
| GenomicFeatures version at creation time: 1.7.30
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0

Which tells me exactly which biomaRt data sources were used for  
constructing the database.

   Marc

On 04/16/2012 10:50 AM, Marc Carlson wrote:
> Hi Ravi,
>
> I think part of your question is about whether or not we can trust 
> ensembl to be internally consistent between what they put into their 
> gtf files and what they expose via biomaRt.  That's not really a 
> bioconductor question since we really only present what is available 
> at the resource in question, but we can still use bioconductor to ask 
> questions about it.  You can for example use the import() method from 
> rtracklayer to bring the information in from the gtf file and compare 
> that to the information that makeTranscriptDbFromBiomart() assembles 
> from biomaRt.  I would encourage you to make comparisons if you feel 
> motivated, (but bear in mind that some kinds of data may not be 
> present in the GTF file).  And if you should find any legitimate 
> discrepancies, the people at ensembl are usually quite responsive at 
> explaining or correcting them (depending on what is appropriate).  But 
> usually, there are no real problems with this resource.  The folks at 
> ensembl are highly reliable.
>
> But as Steve was pointing out, even if everything is the same you will 
> have reads that are just not part of the known transcriptome.  So some 
> proportion of your reads are not going to match up to anything that is 
> well characterized.  Of the reads that don't match up, some of them 
> are likely to be from unknown transcripts, and some will be noise, but 
> both are to be expected.
>
>
>   Marc
>
>
>
>
> On 04/16/2012 06:41 AM, Steve Lianoglou wrote:
>> Hi,
>>
>> On Sat, Apr 14, 2012 at 4:40 PM, Ravi Karra<ravi.karra at gmail.com>  
>> wrote:
>>> Hi,
>>>
>>> Just starting to learn how to look at RNA Seq data, so apologies in 
>>> advance.  I ran my RNA-Seq experiment on a GAII and aligned to the 
>>> zebrafish genome using Bowtie2/Tophat2.  I downloaded the current 
>>> zebrafish genome (Zv9) and transcript gtf file from Ensembl for the 
>>> reference indices.   I am trying to use edgeR to look at 
>>> differential expression, but am a little hung up on getting the 
>>> count data.
>>>
>>> As you can see from the code below, I input 8835090 mapped reads, 
>>> but only 5380643 are overlapped with known transcripts.  It seems 
>>> that I am losing reads in summarizing the count data and I can't 
>>> really figure out why.   Is the transcript information that results 
>>> from makeTranscriptDbFromBiomart identical to the transcript 
>>> information in the gtf files that can be downloaded via Ensembl?
>> Assume for the moment that it is identical -- you will (for sure)
>> still have reads to regions where no transcripts are annotated. This
>> still happens in organisms with "better" annotations than zebrafish,
>> such as fruit fly, mouse, and human.
>>
>> The limit of our knowledge about what, where, and why regions of the
>> genome are transcribed can be equally exciting as it is frustrating
>> depending on which side of the fence you happen to be standing on a
>> particular day.
>>
>> -steve
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor