[BioC] problem creating own org.Ss.eg.db

Marc Carlson mcarlson at fhcrc.org
Thu Mar 1 01:17:43 CET 2012


Hi Guido,

I am not surprised by any of that.  Annotations change more than most 
people expect.  And the pig org package is made (special) here along 
with all the other major organism packages that we support.  It is not 
generated by the method you used because that method has to work for ALL 
organisms that are at NCBI.  That means that when you use that method, 
you don't get any of the extras that we add for you here.  The 
auto-generated version is just generated by using information from NCBI 
with the help of some external GO mappings.  And it is really not meant 
to be a way to get a newer package.  It is meant to allow people who are 
using non-model organisms to get annotations.  So it is expected that 
some things are definitely going to be missing unless you want to do the 
extra work of finding those things and adding them back in manually.

And it is really not possible to keep the "too new" GO terms when you 
generate the package either, because that would break a lot of software 
that depends on GO.db being in sync with the organism packages.  The 
only way around that would be if you also generated a new GO.db package 
from scratch.  I suppose that you could do that, and if you did (and 
installed it) the method would stop trying to drop those "too new" GO 
terms, but it would be a lot of work to generate that package from 
scratch.  And even if you used it you would lose some of the benefits of 
using versioned annotation packages.  Personally, I would never 
recommend that strategy, I only mention it so that you can understand 
what is happening here (and why).

A new release of Bioconductor should drop in about a month and with it 
will be an update to org.Ss.eg.db.  If you are feeling impatient, there 
should be a new package in devel even sooner than that..


   Marc


On 02/28/2012 02:45 PM, Hooiveld, Guido wrote:
> Hi,
> Triggered by a recent comment of Herve on this list [stating that it would be relatively easy to create your own org.xx.eg.db annotation info using the function 'makeOrgPackageFromNCBI'], I decided to create my own instance of the annotation library org.Ss.eg.db. Reason for this is that after the latest BioC release in October 2011, NCBI has made available a major update on annotation info for pig which I already would like to make use of (to be precise, scrofa10.2 has been released earlier this year http://www.ncbi.nlm.nih.gov/mapview/stats/BuildStats.cgi?taxid=9823&build=4&ver=1).
>
> However, by doing so some issues arose:
> - my instance of the org.Ss.eg database is apparently incomplete; some fields are dropped when creating the db, and I also noticed this when comparing the content of the 'official' BioC-provided org.db with that of mine (KEGG info seems to be lacking). Also an error is reported when listing the content of my org.db (RefSeq  2 EG mappings are not included). However, with respect to e.g. Gene Ontology mappings my instance of the org.db seems to be more complete, since more genes do have an GO mapping now (33506 out of 33506 vs  5730 out of 34804). However, I don't fully trust this because of the before-mentioned dropping of fields. More/complete output below.
> - during the creation of the db, some GO terms are apparently too new. Would it somehow be possible to also include these 'too new' terms in the org.db?
>
> Any feedback would be appreciated.
>
> Thanks,
> Guido
>
>
>> library(AnnotationDbi)
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
>    Vignettes contain introductory material. To view, type
>    'browseVignettes()'. To cite Bioconductor, see
>    'citation("Biobase")' and for packages 'citation("pkgname")'.
>
>> makeOrgPackageFromNCBI(version = "0.1",
> +                        author = "Guido Hooiveld<guido.hooiveld at wur.nl>",
> +                        maintainer = "Guido Hooiveld<guido.hooiveld at wur.nl>",
> +                        outputDir = ".",
> +                        tax_id = "9823",
> +                        genus = "Sus",
> +                        species = "scrofaGH")
> Loading required package: RSQLite
> Loading required package: DBI
> Loading required package: GO.db
>
> Getting data for gene2pubmed.gz
> Loading required package: RCurl
> Loading required package: bitops
> Populating gene2pubmed table:
> table gene2pubmed filled
> Getting data for gene2accession.gz
> Populating gene2accession table:
> table gene2accession filled
> Getting data for gene2refseq.gz
> Populating gene2refseq table:
> table gene2refseq filled
> Getting data for gene2unigene
> Populating gene2unigene table:
> table gene2unigene filled
> Getting data for gene_info.gz
> Populating gene_info table:
> table gene_info filled
> Getting data for gene2go.gz
> Populating gene2go table:
> Getting blast2GO data as a substitute for gene2go
> table metadata filled
> table map_metadata filled
> table gene2go filled
> table metadata filled
> table map_metadata filled
> Populating genes table:
> genes table filled
> Populating gene_info_temp table:
> gene_info_temp table filled
> Populating alias table:
> alias table filled
> Populating chromosomes table:
> chromosomes table filled
> Populating pubmed table:
> pubmed table filled
> Populating refseq table:
> refseq table filled
> Populating accessions table:
> accessions table filled
> Populating unigene table:
> unigene table filled
> Dropping GO IDs that are too new for the current GO.db
> Dropping GO IDs that are too new for the current GO.db
> Dropping GO IDs that are too new for the current GO.db
> Populating go_bp table:
> go_bp table filled
> Populating go_mf table:
> go_mf table filled
> Populating go_cc table:
> go_cc table filled
> Populating go_bp_all table:
> go_bp_all table filled
> Populating go_mf_all table:
> go_mf_all table filled
> Populating go_cc_all table:
> go_cc_all table filled
> dropping table gene2pubmeddropping table gene2accessiondropping table gene2refseqdropping table gene2unigenedropping table gene_infodropping table gene2go
> SELECT count(DISTINCT g.gene_id) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.gene_name NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.symbol NOT NULL
> SELECT count(DISTINCT t.symbol) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.symbol NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM chromosomes AS t, genes as g WHERE t._id=g._id AND t.chromosome NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM refseq AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
> SELECT count(DISTINCT t.accession) FROM refseq AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM unigene AS t, genes as g WHERE t._id=g._id AND t.unigene_id NOT NULL
> SELECT count(DISTINCT t.unigene_id) FROM unigene AS t, genes as g WHERE t._id=g._id AND t.unigene_id NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM accessions AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
> SELECT count(DISTINCT t.accession) FROM accessions AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM alias AS t, genes as g WHERE t._id=g._id AND t.alias_symbol NOT NULL
> table map_counts filled
> Creating package in ./org.SscrofaGH.eg.db
> [1] TRUE
>
>
>
> <<content of my instance of org.Ss.eg.db>>
>> library(org. SscrofaGH.eg.db)
> Loading required package: AnnotationDbi
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
>    Vignettes contain introductory material. To view, type
>    'browseVignettes()'. To cite Bioconductor, see
>    'citation("Biobase")' and for packages 'citation("pkgname")'.
>
> Loading required package: DBI
>> org.SscrofaGH.eg.db
> OrgDb object:
> | BL2GOSOURCEDATE: Tue Feb 28 12:50:25 2012
> | BL2GOSOURCENAME: blast2GO
> | BL2GOSOURCEURL: http://www.blast2go.de/
> | DBSCHEMAVERSION: 2.1
> | DBSCHEMA: ORGANISM_DB
> | ORGANISM: Sus scrofaGH
> | SPECIES: Sus ScrofaGH
> | CENTRALID: EG
> | TAXID: 9823
> | EGSOURCEDATE: Tue Feb 28 12:50:27 2012
> | EGSOURCENAME: Entrez Gene
> | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
> | GOSOURCEDATE: 20110910
> | GOSOURCENAME: Gene Ontology
> | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godata
> | GOEGSOURCEDATE: Tue Feb 28 12:50:27 2012
> | GOEGSOURCENAME: Entrez Gene
> | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
> | Db type: OrgDb
> | package: AnnotationDbi
>> org. SscrofaGH.eg()
> Quality control information for org. SscrofaGH.eg:
>
> This package has the following mappings:
>
> org.SscrofaGH.egALIAS2EG has 33506 mapped keys (of 33506 keys)
> org.SscrofaGH.egCHR has 33506 mapped keys (of 33506 keys)
> org.SscrofaGH.egGENENAME has 33506 mapped keys (of 33506 keys)
> org.SscrofaGH.egGO has 33506 mapped keys (of 33506 keys)
> org.SscrofaGH.egGO2ALLEGS has 33506 mapped keys (of 10755 keys)
> org.SscrofaGH.egGO2EG has 33506 mapped keys (of 7256 keys)
> org.SscrofaGH.egREFSEQ has 33506 mapped keys (of 33506 keys)
> Error in get(mapname) : object 'org.SscrofaGH.egREFSEQ2EG' not found
>
>
> <<content of original, BioC-provided org.Ss.eg.db)
>> library(org.Ss.eg.db)
>> org.Ss.eg.db
> OrgDb object:
> | DBSCHEMAVERSION: 2.1
> | Db type: OrgDb
> | package: AnnotationDbi
> | DBSCHEMA: PIG_DB
> | ORGANISM: Sus scrofa
> | SPECIES: Pig
> | EGSOURCEDATE: 2011-Sep14
> | EGSOURCENAME: Entrez Gene
> | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
> | CENTRALID: EG
> | TAXID: 9823
> | GOSOURCENAME: Gene Ontology
> | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
> | GOSOURCEDATE: 20110910
> | GOEGSOURCEDATE: 2011-Sep14
> | GOEGSOURCENAME: Entrez Gene
> | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
> | KEGGSOURCENAME: KEGG GENOME
> | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
> | KEGGSOURCEDATE: 2011-Mar15
> | BL2GOSOURCENAME: blast2GO
> | BL2GOSOURCEURL: http://www.blast2go.de/
> | BL2GOSOURCEDATE: 2011-Mar2
>> org.Ss.eg
> Quality control information for org.Ss.eg:
>
> This package has the following mappings:
>
> org.Ss.egACCNUM has 24639 mapped keys (of 34084 keys)
> org.Ss.egACCNUM2EG has 74012 mapped keys (of 74012 keys)
> org.Ss.egALIAS2EG has 29916 mapped keys (of 29916 keys)
> org.Ss.egCHR has 33656 mapped keys (of 34084 keys)
> org.Ss.egENZYME has 1657 mapped keys (of 34084 keys)
> org.Ss.egENZYME2EG has 818 mapped keys (of 818 keys)
> org.Ss.egGENENAME has 34084 mapped keys (of 34084 keys)
> org.Ss.egGO has 5730 mapped keys (of 34084 keys)
> org.Ss.egGO2ALLEGS has 11689 mapped keys (of 11689 keys)
> org.Ss.egGO2EG has 8215 mapped keys (of 8215 keys)
> org.Ss.egPATH has 4458 mapped keys (of 34084 keys)
> org.Ss.egPATH2EG has 225 mapped keys (of 225 keys)
> org.Ss.egPMID has 10966 mapped keys (of 34084 keys)
> org.Ss.egPMID2EG has 3938 mapped keys (of 3938 keys)
> org.Ss.egREFSEQ has 24384 mapped keys (of 34084 keys)
> org.Ss.egREFSEQ2EG has 53138 mapped keys (of 53138 keys)
> org.Ss.egSYMBOL has 34084 mapped keys (of 34084 keys)
> org.Ss.egSYMBOL2EG has 28138 mapped keys (of 28138 keys)
> org.Ss.egUNIGENE has 8798 mapped keys (of 34084 keys)
> org.Ss.egUNIGENE2EG has 8912 mapped keys (of 8912 keys)
> org.Ss.egUNIPROT has 6660 mapped keys (of 34084 keys)
>
>
> Additional Information about this package:
>
> DB schema: PIG_DB
> DB schema version: 2.1
> Organism: Sus scrofa
> Date for NCBI data: 2011-Sep14
> Date for GO data: 20110910
> Date for KEGG data: 2011-Mar15
>
>
>> sessionInfo()<<session when creating org.db>>
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] RCurl_1.9-5           bitops_1.0-4.1        GO.db_2.6.1
> [4] RSQLite_0.11.1        DBI_0.2-5             AnnotationDbi_1.16.11
> [7] Biobase_2.14.0
>
> loaded via a namespace (and not attached):
> [1] IRanges_1.12.5 tools_2.14.0
>> sessionInfo()<<session when comparing the 2 org.dbs>>
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] org.Ss.eg.db_2.6.4    org.SscrofaGH.eg.db_1.0    RSQLite_0.11.1
> [4] DBI_0.2-5             AnnotationDbi_1.16.11 Biobase_2.14.0
>
> loaded via a namespace (and not attached):
> [1] IR
>
>
> Gr, Guido
>
> ---------------------------------------------------------
> Guido Hooiveld, PhD
> Nutrition, Metabolism&  Genomics Group
> Division of Human Nutrition
> Wageningen University
> Biotechnion, Bomenweg 2
> NL-6703 HD Wageningen
> the Netherlands
> tel: (+)31 317 485788
> fax: (+)31 317 483342
> email:      guido.hooiveld at wur.nl
> internet:   http://nutrigene.4t.com
> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
> http://www.researcherid.com/rid/F-4912-2010
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list