[BioC] What populates makeTranscriptDbFromBiomart?
Marc Carlson
mcarlson at fhcrc.org
Tue Apr 17 01:23:45 CEST 2012
One more thing that I neglected to mention is that you can learn details
about your transcriptDb object by just looking at the output of it's
show method. So for example if I have loaded:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## then I can just look at the object like:
TxDb.Hsapiens.UCSC.hg19.knownGene
## And I see an output like this:
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: UCSC
| Genome: hg19
| Genus and Species: Homo sapiens
| UCSC Table: knownGene
| Resource URL: http://genome.ucsc.edu/
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| miRBase build ID: GRCh37
| transcript_nrow: 80922
| exon_nrow: 286852
| cds_nrow: 235842
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-03-12 21:45:23 -0700 (Mon, 12 Mar 2012)
| GenomicFeatures version at creation time: 1.7.30
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0
Which tells me a whole lot about where this data came from, what build
of the genome it was based upon etc. Now if I had looked at a package
based on biomaRt it would look similar. For example:
library("TxDb.Athaliana.BioMart.plantsmart12")
TxDb.Athaliana.BioMart.plantsmart12
## Shows me this:
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: BioMart
| Genus and Species: Arabidopsis thaliana
| Resource URL: www.biomart.org:80
| BioMart database: plants_mart_12
| BioMart database version: ENSEMBL PLANTS 12 (EBI UK)
| BioMart dataset: athaliana_eg_gene
| BioMart dataset description: Arabidopsis thaliana genes (TAIR10)
| BioMart dataset version: TAIR10
| Full dataset: yes
| miRBase build ID: NA
| transcript_nrow: 41671
| exon_nrow: 171013
| cds_nrow: 0
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-03-13 09:54:23 -0700 (Tue, 13 Mar 2012)
| GenomicFeatures version at creation time: 1.7.30
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0
Which tells me exactly which biomaRt data sources were used for
constructing the database.
Marc
On 04/16/2012 10:50 AM, Marc Carlson wrote:
> Hi Ravi,
>
> I think part of your question is about whether or not we can trust
> ensembl to be internally consistent between what they put into their
> gtf files and what they expose via biomaRt. That's not really a
> bioconductor question since we really only present what is available
> at the resource in question, but we can still use bioconductor to ask
> questions about it. You can for example use the import() method from
> rtracklayer to bring the information in from the gtf file and compare
> that to the information that makeTranscriptDbFromBiomart() assembles
> from biomaRt. I would encourage you to make comparisons if you feel
> motivated, (but bear in mind that some kinds of data may not be
> present in the GTF file). And if you should find any legitimate
> discrepancies, the people at ensembl are usually quite responsive at
> explaining or correcting them (depending on what is appropriate). But
> usually, there are no real problems with this resource. The folks at
> ensembl are highly reliable.
>
> But as Steve was pointing out, even if everything is the same you will
> have reads that are just not part of the known transcriptome. So some
> proportion of your reads are not going to match up to anything that is
> well characterized. Of the reads that don't match up, some of them
> are likely to be from unknown transcripts, and some will be noise, but
> both are to be expected.
>
>
> Marc
>
>
>
>
> On 04/16/2012 06:41 AM, Steve Lianoglou wrote:
>> Hi,
>>
>> On Sat, Apr 14, 2012 at 4:40 PM, Ravi Karra<ravi.karra at gmail.com>
>> wrote:
>>> Hi,
>>>
>>> Just starting to learn how to look at RNA Seq data, so apologies in
>>> advance. I ran my RNA-Seq experiment on a GAII and aligned to the
>>> zebrafish genome using Bowtie2/Tophat2. I downloaded the current
>>> zebrafish genome (Zv9) and transcript gtf file from Ensembl for the
>>> reference indices. I am trying to use edgeR to look at
>>> differential expression, but am a little hung up on getting the
>>> count data.
>>>
>>> As you can see from the code below, I input 8835090 mapped reads,
>>> but only 5380643 are overlapped with known transcripts. It seems
>>> that I am losing reads in summarizing the count data and I can't
>>> really figure out why. Is the transcript information that results
>>> from makeTranscriptDbFromBiomart identical to the transcript
>>> information in the gtf files that can be downloaded via Ensembl?
>> Assume for the moment that it is identical -- you will (for sure)
>> still have reads to regions where no transcripts are annotated. This
>> still happens in organisms with "better" annotations than zebrafish,
>> such as fruit fly, mouse, and human.
>>
>> The limit of our knowledge about what, where, and why regions of the
>> genome are transcribed can be equally exciting as it is frustrating
>> depending on which side of the fence you happen to be standing on a
>> particular day.
>>
>> -steve
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list