[BioC] Map GO terms to Uniprot from org.Hs.eg

Martin Morgan mtmorgan at fhcrc.org
Wed Sep 14 15:33:31 CEST 2011


On 09/14/2011 03:53 AM, Sandeep Amberkar wrote:
> Dear All,
>
>
> I have loaded the dataset "org.Hs.eg" into my R-session. Being using it for
> the first time, I am not familiar with its data structure. Can anyone please
> help me in building a table that contains ontology wise mapping to Uniprot
> identifiers? I want the final output table to look something like this --

Hi Sandeep --

Load an 'org' package for your organism of interest

   library(org.Hs.eg.db)

define your identifiers of interest; I guess you have uniprot id's, but 
maybe you are starting from somewhere else, or just want a big table??

   uniprot <- c("A0A183", "A0A5E8", "A0A962", "A0AUX0", "A0AUZ9",
                "A0AV02")

The org packages are arranged as 'bi-maps', from a left key to a right 
key. The org package left key is always an Entrez gene id. There is a 
map from Entrez gene id to Uniprot id. You need to reverse the map, and 
then create a subset that has just your own identifiers.

   egmap <- revmap(org.Hs.egUNIPROT)[uniprot]

You can explore your map, e.g., by casting it to a data.frame

 > toTable(egmap)

or looking at the left keys (i.e., Entrez gene ids) that are mapped

 > mappedLkeys(egmap)
[1] "10634"  "151050" "272"    "448835" "55072"  "84561"

Having got to the Entrez gene ids, the next step is to create a map that 
goes to GO terms -- same as before, but no need to reverse the map

   gomap <- org.Hs.egGO[mappedLkeys(egmap)]

It's a bigger table and worth exploring; here's the top six rows of the 
data.frame

 > head(toTable(gomap))
   gene_id      go_id Evidence Ontology
1   10634 GO:0007050      IEA       BP
2     272 GO:0006144      TAS       BP
3     272 GO:0006196      TAS       BP
4     272 GO:0009117      IEA       BP
5     272 GO:0009168      IEA       BP
6     272 GO:0043101      TAS       BP

The first thing is that the mapping between gene id and GO term is not 
1:1. The second thing is that there are different types of evidence 
codes supporting each map. You need to decide how much of this table 
you'd like to keep; my choice is to keep all, but trying to adhere to 
your request drop the 'Evidence' column. This might leave some duplicate 
rows, and I remove them

   unique(toTable(gomap)[,-3])

This and toTable(egmap) contain the information you want, and we'd like 
to merge the data

   merge(toTable(egmap), unique(toTable(gomap)[,-3]))

Here's a bit of what we get

    gene_id uniprot_id      go_id Ontology
1    10634     A0A5E8 GO:0007050       BP
2    10634     A0A5E8 GO:0005856       CC
3    10634     A0A5E8 GO:0005737       CC
4      272     A0AUX0 GO:0009168       BP
5      272     A0AUX0 GO:0006144       BP
6      272     A0AUX0 GO:0006196       BP

which is not what exactly what you wanted, but reflects the reality that 
the mapping between gene id and ontology is not 1:1, so

 >
 > Uniprot           GO_BP         GO_CC        GO_MF
 > ABC123         GO:121         GO:122         GO:123
 >
 > Thanks in advance for your help.

is not a sensible representation. The short version of this is just 3 lines

 > egmap <- revmap(org.Hs.egUNIPROT)[uniprot]
 > gomap <- org.Hs.egGO[mappedLkeys(egmap)]
 > merge(toTable(egmap), unique(toTable(gomap)[,-3]))

so not as bad as the long-winded version might make it seem.

Hope that helps,

Martin
>
> Warm Regards,
> Sandeep Amberkar
> BioQuant,BQ26,
> Im Neuenheimer Feld 267,
> D-69120,Heidelberg
> Tel: +49-6221-5451354
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list