[BioC] fastest way to get a gene list having certain GO term
Duke
duke.lists at gmx.com
Wed Mar 7 18:19:25 CET 2012
On 3/6/12 5:24 PM, Duke wrote:
> On 3/6/12 3:47 PM, Duke wrote:
>> Hi folks,
>>
>> I need some statistics for a certain GO term (for example, "DNA
>> binding"), and I wonder what is the fastest way to archive the latest
>> list of genes having that specific GO term. There are now a number of
>> GO packages and I would like to hear/learn your experience regarding
>> various different packages.
>
> To archive the above task, I separated it as two processes:
>
> * Get all GO IDs having specific term ("DNA binding")
> * Then, get all the genes having the resulting GO IDs
>
> I think I got the numbers now:
>
> library("GO.db")
> library("org.Hs.eg.db")
>
> GOTerm2GOID = function(term){
> GTL = eapply(GOTERM, function(x){grep(term, x at Term, value=TRUE)})
> GID = sapply(GTL, length)
> names(GTL[GID > 0])
> }
>
> length(unlist(sapply(GOTerm2GOID("DNA binding"), function(x) mget(x,
> revmap(org.Hs.egGO), ifnotfound=NA))))
> 4265
>
> However, I am still stuck at how to get the gene symbols (Il22, Foxp3
> for example) as well as RefSeq ID of the resulting gene list.
Thanks David for your suggestion. I had some quick readings on
org.Hs.eg.db as you suggested and finally came up with:
library("GO.db")
library("org.Hs.eg.db")
GOTerm2GOID <- function(term){
GTL = eapply(GOTERM, function(x){grep(term, x at Term, value=TRUE)})
GID = sapply(GTL, length)
names(GTL[GID > 0])
}
ENTREZ2GENES <- function(entrezList=NULL, geneList=NULL){
geneVec = as.vector(unlist(sapply(entrezList,
function(x){geneList[names(geneList)==x]})))
geneVec = geneVec[order(geneVec)]
geneVec = geneVec[!duplicated(geneVec)]
geneVec
}
## get all gene entrez ids
d = as.data.frame(unlist(sapply(GOTerm2GOID("RNA binding"), function(x)
mget(x, revmap(org.Hs.egGO), ifnotfound=NA))))
geneIDs = as.vector(d[!is.na(d[[1]]),])
geneIDs = geneIDs[order(geneIDs)]
geneIDs = geneIDs[!duplicated(geneIDs)]
length(geneIDs)
## get all refseq IDs
refseqDat <- org.Hs.egREFSEQ
mapped_genes <- mappedkeys(refseqDat)
refseqDatList <- as.list(refseqDat[mapped_genes])
length(ENTREZ2GENES(entrezList=geneIDs, geneList=refseqDatList))
## get all gene symbol
geneSymDat <- org.Hs.egSYMBOL
mapped_genes <- mappedkeys(geneSymDat)
geneSymDatList <- as.list(geneSymDat[mapped_genes])
length(ENTREZ2GENES(entrezList=geneIDs, geneList=geneSymDatList))
I think the code works fine, and I get what I need, but I still believe
there should be a faster way to achieve the same thing. I would like to
hear more from experienced users.
Thanks,
D.
More information about the Bioconductor
mailing list