[Bioc-sig-seq] ChIPpeakAnno, BioMart, getAnnotation 'Exon' error message

Steffen Durinck sdurinck at lbl.gov
Thu Mar 18 06:49:41 CET 2010


Hi  Julie,

This is a decision by the people who develop and create the database
and your question about why they implemented things a certain way
(e.g. two rows for an exon id, each with a different description) can
probably only answered by them.

biomaRt is a package that can speak to different BioMart databases but
doesn't control the data content or the way the data is related to
each other in each database.
Each BioMart database is created by a different set of people,
plant_mart_4 for example is maintained by the Ensembl team (see the
version column when you do listMarts() ).  The best place to find an
answer to your question is thus: helpdesk at ensembl.org

Cheers,
Steffen



On 3/17/10, Zhu, Julie <Julie.Zhu at umassmed.edu> wrote:
> Hi Wolfgang,
>
> Thank you very much for looking into this! For one exon ID , there are more
> than one row with different descriptions. I thought this is an entry error.
> Now I understand that it is actually the result of lack of full
> understanding of the exon rather than database issue after you pointed out
> that multiple rows for such an exon are actually intentional.  I am curious
> about the reason to have separate rows for such a exon instead of having one
> row with all possible descriptions. Perhaps because there are different
> evidences associated with each row. Is there a flag for each such row that
> indicates the rank or plausibility? If there is, I will incorporate the flag
> into the getAnnotation function in ChIPpeakAnno.  I deeply appreciate your
> thoughts on this.
>
> Thanks again for helping me to have a deeper understanding of the system!
>
> Best regards,
>
> Julie
>
>
> *******************************************
> Lihua Julie Zhu, Ph.D
> Research Associate Professor
> Program Gene Function and Expression
> University of Massachusetts Medical School
> 364 Plantation Street, Room 613
> Worcester, MA 01605
> 508-856-5256
> http://www.umassmed.edu/pgfe/faculty/zhu.cfm
> *******************************************
>
>
>
> On 3/17/10 9:48 AM, "Wolfgang Huber" <whuber at embl.de> wrote:
>
> Julie
>
> why do you say that "the database contains errors" ? I had a look at
> http://gbrowse.arabidopsis.org/cgi-bin/gbrowse/arabidopsis/?name=AT1G68552.1
> and while this is perhaps a complex locus whose expression we have not
> yet fully understood, or not yet properly formalised into the database's
> ontology of genomic features and gene products, I am not sure "error" is
> the right term for that.
>
> Arabidopsis people might have more insight on that.
>
>         Wolfgang
>
>
>
>
> Zhu, Julie scripsit 16/03/10 22:56:
>> Hi,
>>
>> I obtained the exon sequences and here are the duplicate exon IDs with
>> different descriptions.
>>
>> TSS[duplicated(TSS[,1]), 1]
>>  [1] "AT1G68552.1-E12203"  "AT1G64140.1-E14755"  "AT1G64140.1-E14756"
>> "AT1G70780.1-E4116"
>>  [5] "AT1G75390.1-E22428"  "AT1G06149.1-E1988"   "AT1G36730.1-E35050"
>> "AT1G36730.1-E35051"
>>  [9] "AT1G29952.1-E5728"   "AT1G29952.1-E5730"   "AT1G29952.1-E5732"
>> "AT1G29970.2-E8863"
>> [13] "AT1G29970.2-E8864"   "AT1G64628.1-E10574"  "AT1G25470.1-E20679"
>> "AT1G58120.1-E18468"
>> [17] "AT1G29041.1-E15117"  "AT1G23149.1-E13728"  "AT1G29952.1-E5728"
>> "AT1G29952.1-E5732"
>> [21] "AT2G18162.1-E49029"  "AT3G51632.1-E98183"  "AT3G22970.1-E89708"
>> "AT3G45240.2-E86808"
>> [25] "AT3G18000.1-E98438"  "AT3G59052.1-E77046"  "AT3G62422.1-E76351"
>> "AT3G25570.1-E88575"
>> [29] "AT3G25570.1-E88576"  "AT3G10910.1-E77164"  "AT3G02468.1-E88931"
>> "AT3G12010.1-E78704"
>> [33] "AT3G01470.1-E92685"  "AT3G53402.1-E93478"  "AT3G26430.1-E85151"
>> "AT3G26430.1-E85154"
>> [37] "AT4G19110.1-E121565" "AT4G22592.1-E113550" "AT4G22592.1-E113551"
>> "AT4G22592.1-E113552"
>> [41] "AT4G12430.1-E113931" "AT4G12430.1-E113932" "AT4G12430.1-E113933"
>> "AT4G25670.1-E111076"
>> [45] "AT4G25670.1-E111077" "AT4G36990.1-E122859" "AT4G14620.1-E120308"
>> "AT4G34590.1-E116802"
>> [49] "AT5G09460.1-E136355" "AT5G09460.1-E136357" "AT5G50010.1-E151574"
>> "AT5G50010.1-E151576"
>> [53] "AT5G50010.1-E151574" "AT5G50011.1-E153108" "AT5G50011.1-E153110"
>> "AT5G09460.1-E136355"
>> [57] "AT5G09463.1-E151757" "AT5G09463.1-E151758" "AT5G52552.1-E136887"
>> "AT5G52552.1-E136888"
>> [61] "AT5G41992.1-E154552" "AT5G64341.1-E144370" "AT5G64341.1-E144371"
>> "AT5G64341.1-E144373"
>> [65] "AT5G64341.1-E144370" "AT5G64341.1-E144371" "AT5G64343.1-E148873"
>> "AT5G64341.1-E144373"
>> [69] "AT5G09460.1-E136355" "AT5G09463.1-E151757" "AT5G09460.1-E136357"
>> "AT5G09463.1-E151758"
>> [73] "AT5G49448.1-E171824" "AT5G05282.1-E152619" "AT5G53588.1-E159453"
>> "AT5G09670.2-E157563"
>> [77] "AT5G01710.1-E140929" "AT5G64341.1-E144370" "AT5G64343.1-E148873"
>> "AT5G61230.1-E153842"
>> [81] "AT5G61230.1-E153843" "AT5G60550.1-E140873" "AT5G64552.1-E148753"
>> "AT5G64552.1-E148754"
>> [85] "AT5G45430.1-E151338"
>>
>> For example,
>>
>> TSS[TSS[,1]=="AT1G68552.1-E12203",]
>>          ensembl_exon_id chromosome_name exon_chrom_start exon_chrom_end
>> strand
>> 3125  AT1G68552.1-E12203               1         25727627       25727701
>>   -1
>> 15537 AT1G68552.1-E12203               1         25727627       25727701
>>   -1
>>
>>
>>
>>
>>                                                 description
>> 3125  CPuORF53 (Conserved peptide upstream open reading frame 53);
>> Upstream open reading frames (uORFs) are small open reading frames found
>> in the 5' UTR of a mature mRNA, and can potentially mediate translational
>> regulation of the largest, or major, ORF (mORF). CPuORF53 represents a
>> conserved upstream opening reading frame relative to major ORF AT1G68550.1
>> 15537
>>                            AP2 domain-containing transcription factor,
>> putative; encodes a member of the ERF (ethylene response factor) subfamily
>> B-6 of ERF/AP2 transcription factor family. The protein contains one AP2
>> domain. There are 12 members in this subfamily including RAP2.11.
>>
>> So I think the database contains errors. In this case, it will require
>> manual curation to determine which row to choose. Did you contact ensembl
>> about this? Thanks!
>>
>> Best regards,
>>
>> Julie
>>
>>
>> *******************************************
>> Lihua Julie Zhu, Ph.D
>> Research Associate Professor
>> Program Gene Function and Expression
>> University of Massachusetts Medical School
>> 364 Plantation Street, Room 613
>> Worcester, MA 01605
>> 508-856-5256
>> http://www.umassmed.edu/pgfe/faculty/zhu.cfm
>> *******************************************
>>
>> On 3/5/10 6:46 PM, "pterry at huskers.unl.edu" <pterry at huskers.unl.edu>
>> wrote:
>>
>>
>>
>>  Dear bioc-sig-sequencing,
>>
>> I would like to annotate chip-seq peaks for the arabidopsis genome.  "TSS"
>> and "Exon" are two of the arguments for the 'getAnnotation' function.  The
>> "TSS" argument succeeded, but the "Exon" argument failed.
>>
>> ...
>>> arabdset<-useMart(biomart="plant_mart_4", dataset = "athaliana_eg_gene")
>> Checking attributes ... ok
>> Checking filters ... ok
>>> ExonArabAnno<-getAnnotation(arabdset, featureType="Exon")
>> Error in `rownames<-`(`*tmp*`, value = c("ATCG00010.1-E176369",
>> "ATMG00010.1-E176520",  :
>>   duplicate rownames not allowed
>>
>>> sessionInfo()
>> R version 2.11.0 Under development (unstable) (2010-02-28 r51186)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>  [1] ChIPpeakAnno_1.3.4                  org.Hs.eg.db_2.3.6
>>  [3] GO.db_2.3.5                         RSQLite_0.8-3
>>  [5] DBI_0.2-5                           AnnotationDbi_1.9.4
>>  [7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.15.11
>>  [9] Biostrings_2.15.22                  IRanges_1.5.51
>> [11] multtest_2.3.0                      Biobase_2.7.4
>> [13] biomaRt_2.3.4
>>
>> loaded via a namespace (and not attached):
>> [1] MASS_7.3-5      RCurl_1.3-1     splines_2.11.0  survival_2.35-8
>> [5] tools_2.11.0    XML_2.6-0
>>
>> Can someone comment?
>>
>>
>> Thanks,
>> P. Terry
>> pterry at huskers.unl.edu
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>
>>
>>       [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> --
>
> Best wishes
>       Wolfgang
>
>
> --
> Wolfgang Huber
> EMBL
> http://www.embl.de/research/units/genome_biology/huber/contact
>
>
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>



More information about the Bioc-sig-sequencing mailing list