[BioC] problems generating a gene2GOlist in topGO

Fri Mar 23 11:23:15 CET 2012

Hi Antonio,

you are right, the main problem is with the "test.GO.BP" object. It
must be a list of mappings from genes to GO terms. You can obtain such
a list from your data.frame object by (code not tested):

> gene.to.GO <- split(test.GO.BP$go_biological_process_id,  test.GO.BP$ensembl_gene_id)
> gene.to.GO <- lapply(gene.to.GO, unique)  # to remove duplicates

This will give you a named list, where the list names are the Ensembl
gene identifiers,  and the list entries are the GO terms annotated
with the respective gene.

There is another problem with your data. The list of gene scores
"geneList" contains duplicated names as I can see from your output
(ENSMUSG00000025903 appears 4 times with different scores). This is
not allowed in topGO, and you should find a way to remove the
duplicates.

Hope this helps.

Regard,
Adrian Alexa

On Wed, Mar 21, 2012 at 3:43 PM, António Miguel de Jesus Domingues
<amjdomingues at gmail.com> wrote:
> Dear Bioconductor list,
>
> I have a list of genes from a mouse array (custom design) for which I want
> to perform an analysis with topGO. The package example is running fine and
> I have read the vignettes (though I've probably missed something) but when
> running my own data an error is generated that seems to be related to my
> custom Gene-to-GO map.
>
> The results are a table with several annotations and custom measure of
> significance. I've created a named vector (list) containing all the genes
> present in the array (ensembl IDs) with the corresponding measure of
> significance - geneList.
>
> geneList <- abs(data[ ,2])
> names(geneList) <- data[ ,1]
> geneList[1:5]
> ENSMUSG00000025903 ENSMUSG00000025903 ENSMUSG00000025903 ENSMUSG00000025903
> ENSMUSG00000033813
> 0.11               0.36               0.32               0.07
> 0.08
>
> is(geneList)
> [1] "numeric"          "vector"           "atomic"
> "EnumerationValue" "numeric or NULL"  "vectorORfactor"
>
> summary(geneList)
> Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> 0.0100  0.0600  0.2000  0.4568  0.5600 18.1600
>
> # a function was then defined to select the significant genes - as in the
> vignette
>
> topDiffGenes <- function(allScore) {
>  return(allScore > 1)
>  }
>
>
> x <- topDiffGenes(geneList)
> sum(x)
>
> # so far so good
> # because this is a custom array the GO annotation was extracted from
> ensemble using BiomaRt.
> # ensembl61 was used because of the gene format in my results
>
> ensembl61=useMart('ENSEMBL_MART_ENSEMBL',dataset='mmusculus_gene_ensembl',
>                  host='feb2011.archive.ensembl.org')
>
> test.GO.BP <- getBM(attributes = c("ensembl_gene_id",
> "go_biological_process_id"), filters = "ensembl_gene_id", values =
> All.genes.Ens,
>                 mart = ensembl61)
> head(test.GO.BP)
>
> ensembl_gene_id go_biological_process_id
> 1 ENSMUSG00000054310               GO:0006355
> 2 ENSMUSG00000054728
> 3 ENSMUSG00000021368               GO:0032313
> 4 ENSMUSG00000021368               GO:0031398
> 5 ENSMUSG00000051335               GO:0055114
> 6 ENSMUSG00000051335               GO:0008152
>
> # but when creating the topGO object a problem appears:
>
> GOdata <- new("topGOdata",
>              description = "GO analysis Test",
>              ontology = "BP",
>              allGenes = geneList,
>              geneSel = topDiffGenes,
>              annot = annFUN.gene2GO,
>              nodeSize = 5,
>              gene2GO = test.GO.BP)
>
> Building most specific GOs ..... ( 0 GO terms found. )
>
> Build GO DAG topology .......... ( 0 GO terms and 0 relations. )
> Error in if (is.na(index) || index < 0 || index > length(nd))
> stop(paste("selected vertex",  :
>  missing value where TRUE/FALSE needed
>
> >From reading the vignette I think that the object test.GO.BP, a data.frame,
> needs to be convert to a list in which each gene corresponds  to several GO
> terms:
>
> List of 6
> $ 068724: chr [1:5] "GO:0005488" "GO:0003774" "GO:0001539" "GO:0006935" ...
> $ 119608: chr [1:6] "GO:0005634" "GO:0030528" "GO:0006355" "GO:0045449" ...
> $ 049239: chr [1:13] "GO:0016787" "GO:0017057" "GO:0005975" "GO:0005783" ...
> $ 067829: chr [1:16] "GO:0045926" "GO:0016616" "GO:0000287" "GO:0030145" ...
> $ 106331: chr [1:10] "GO:0043565" "GO:0000122" "GO:0003700" "GO:0005634" ...
> $ 214717: chr [1:7] "GO:0004803" "GO:0005634" "GO:0008270" "GO:0003677" ...
>
> Is this what I need to do next? If how to do it? Or is it something else?
>
> Any help will be appreciated.
>
> Session info:
>> sessionInfo()
> R version 2.14.2 (2012-02-29)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] C/en_US.UTF-8/C/C/C/C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>  [1] plyr_1.7.1                           genefilter_1.36.0
>   hgu95av2_2.2.0                       hgu95av2.db_2.6.3
>  [5] org.Hs.eg.db_2.6.4                   affyio_1.22.0
>   affydata_1.11.15                     affy_1.32.1
>  [9] multtest_2.10.0                      ALL_1.4.11
>    topGO_2.6.0                          SparseM_0.91
>
> [13] GO.db_2.6.1                          graph_1.32.0
>    mogene10sttranscriptcluster.db_8.0.1 org.Mm.eg.db_2.6.4
>
> [17] RSQLite_0.11.1                       DBI_0.2-5
>   AnnotationDbi_1.16.19                Biobase_2.14.0
> [21] BiocInstaller_1.2.1                  biomaRt_2.10.0
>    Biostrings_2.22.0                    GenomicRanges_1.6.7
>
> [25] IRanges_1.12.6
>
> loaded via a namespace (and not attached):
>  [1] MASS_7.3-17           RColorBrewer_1.0-5    RCurl_1.91-1
>  XML_3.9-4             annotate_1.32.3       colorspace_1.1-1
>  dichromat_1.2-4
>  [8] digest_0.5.2          ggplot2_0.9.0         grid_2.14.2
> lattice_0.20-6        memoise_0.1           munsell_0.3
> preprocessCore_1.16.0
> [15] proto_0.3-9.2         reshape2_1.2.1        scales_0.2.0
>  splines_2.14.2        stringr_0.6           survival_2.36-12
>  tools_2.14.2
> [22] xtable_1.7-0          zlibbioc_1.0.1
>
> --
> --
> António Miguel de Jesus Domingues, PhD
> Neugebauer group
> Max Planck Institute of Molecular Cell Biology and Genetics, Dresden
> Pfotenhauerstrasse 108
> 01307 Dresden
> Germany
>
> e-mail: domingue at mpi-cbg.de
> tel. +49 351 210 2481
> The Unbearable Lightness of Molecular Biology
>
>        [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor