[BioC] Map GO terms to Uniprot from org.Hs.eg
Martin Morgan
mtmorgan at fhcrc.org
Wed Sep 14 15:33:31 CEST 2011
On 09/14/2011 03:53 AM, Sandeep Amberkar wrote:
> Dear All,
>
>
> I have loaded the dataset "org.Hs.eg" into my R-session. Being using it for
> the first time, I am not familiar with its data structure. Can anyone please
> help me in building a table that contains ontology wise mapping to Uniprot
> identifiers? I want the final output table to look something like this --
Hi Sandeep --
Load an 'org' package for your organism of interest
library(org.Hs.eg.db)
define your identifiers of interest; I guess you have uniprot id's, but
maybe you are starting from somewhere else, or just want a big table??
uniprot <- c("A0A183", "A0A5E8", "A0A962", "A0AUX0", "A0AUZ9",
"A0AV02")
The org packages are arranged as 'bi-maps', from a left key to a right
key. The org package left key is always an Entrez gene id. There is a
map from Entrez gene id to Uniprot id. You need to reverse the map, and
then create a subset that has just your own identifiers.
egmap <- revmap(org.Hs.egUNIPROT)[uniprot]
You can explore your map, e.g., by casting it to a data.frame
> toTable(egmap)
or looking at the left keys (i.e., Entrez gene ids) that are mapped
> mappedLkeys(egmap)
[1] "10634" "151050" "272" "448835" "55072" "84561"
Having got to the Entrez gene ids, the next step is to create a map that
goes to GO terms -- same as before, but no need to reverse the map
gomap <- org.Hs.egGO[mappedLkeys(egmap)]
It's a bigger table and worth exploring; here's the top six rows of the
data.frame
> head(toTable(gomap))
gene_id go_id Evidence Ontology
1 10634 GO:0007050 IEA BP
2 272 GO:0006144 TAS BP
3 272 GO:0006196 TAS BP
4 272 GO:0009117 IEA BP
5 272 GO:0009168 IEA BP
6 272 GO:0043101 TAS BP
The first thing is that the mapping between gene id and GO term is not
1:1. The second thing is that there are different types of evidence
codes supporting each map. You need to decide how much of this table
you'd like to keep; my choice is to keep all, but trying to adhere to
your request drop the 'Evidence' column. This might leave some duplicate
rows, and I remove them
unique(toTable(gomap)[,-3])
This and toTable(egmap) contain the information you want, and we'd like
to merge the data
merge(toTable(egmap), unique(toTable(gomap)[,-3]))
Here's a bit of what we get
gene_id uniprot_id go_id Ontology
1 10634 A0A5E8 GO:0007050 BP
2 10634 A0A5E8 GO:0005856 CC
3 10634 A0A5E8 GO:0005737 CC
4 272 A0AUX0 GO:0009168 BP
5 272 A0AUX0 GO:0006144 BP
6 272 A0AUX0 GO:0006196 BP
which is not what exactly what you wanted, but reflects the reality that
the mapping between gene id and ontology is not 1:1, so
>
> Uniprot GO_BP GO_CC GO_MF
> ABC123 GO:121 GO:122 GO:123
>
> Thanks in advance for your help.
is not a sensible representation. The short version of this is just 3 lines
> egmap <- revmap(org.Hs.egUNIPROT)[uniprot]
> gomap <- org.Hs.egGO[mappedLkeys(egmap)]
> merge(toTable(egmap), unique(toTable(gomap)[,-3]))
so not as bad as the long-winded version might make it seem.
Hope that helps,
Martin
>
> Warm Regards,
> Sandeep Amberkar
> BioQuant,BQ26,
> Im Neuenheimer Feld 267,
> D-69120,Heidelberg
> Tel: +49-6221-5451354
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioconductor
mailing list