[BioC] Annotation using the affy package

Wed Oct 24 17:11:00 CEST 2012

Hi Jens,

On 10/24/2012 9:52 AM, Jens Lichtenberg wrote:
> Hi James,
>
> Thank you so much for your help. I am successfully building ens (had 
> to update to 2.15 to use select) but for some reason I am having 
> problems merging/binding the data into the same frame
>
> > data.frame(ens, exprs(eset_rma), assayDataElement(eset_pma, "se.exprs"))
> Error in data.frame(ens, exprs(eset_rma), assayDataElement(eset_pma, 
> "se.exprs")) :
>   arguments imply differing number of rows: 46603, 45101

Indeed. When you first generated your ens data.frame, you got this message:

 > ens <- select(mouse4302.db, Lkeys(mouse4302ENSEMBL), "ENSEMBL")
Warning message:
In .generateExtraRows(tab, keys, jointype) :
   'select' resulted in 1:many mapping between keys and return rows

Which means that there are multiple probesetID -> ENSEMBL mappings for 
some probesets. So now you have to decide what you want to do with these 
multiple mapped probesets. You could either decide that a single unique 
mapping is sufficient, and do this:

 > ens2 <- ens[!duplicated(ens$PROBEID),]
 > nrow(ens2)
[1] 45101

And you can then test to see if ens2 can be cbind()ed to eset_pma:

all.equal(ens2$PROBEID, featureNames(eset_pma), check.attributes = FALSE)

and if TRUE, cbind() away.

Or if you want all of the ENSEMBL IDs, you can just collapse them to 
comma-separated vectors and then incorporate:

ens3 <- tapply(ens$ENSEMBL, ens[,1], paste, collapse = ",")

data.frame(ens3[featureNames(eset_puma)], <other args go here>)

>
> > merge(ens,exprs(eset_rma))
> Error in rep.int <http://rep.int>(rep.int 
> <http://rep.int>(seq_len(nx), rep.int <http://rep.int>(rep.fac, nx)), 
> orep) :
>   cannot allocate vector of length 2101841903
>
> Any idea how I could resolve this issue?

Note that you need to read the help page for the function you are using. 
What do you think happens with merge() if you don't specify the columns 
upon which you intend to merge?

You are trying to merge two things, each of which has less that 47K 
rows. But the error says something about a vector that is over 2.1 
billion items. That should make you say 'Wait, WHAT? What did I do?' and 
then investigate. See ?merge.

Best,

Jim

>
> Jens
>
> On Tue, Oct 23, 2012 at 5:09 PM, James W. MacDonald <jmacdon at uw.edu 
> <mailto:jmacdon at uw.edu>> wrote:
>
>     Hi Jens,
>
>
>     On 10/23/2012 3:48 PM, Jens Lichtenberg [guest] wrote:
>
>         I am using the affy package to analyze a set of GSM files
>         downloaded from GEO. In addition to providing a table with
>         probe ids, expression levels and p values, I would like to
>         have the ensembl ids associated with the probe ids.
>
>         I loaded in the corresponding platform data (in my case
>         mouse4302) but I am not quite sure how to go about the
>         connection of the data.
>
>         Here is the way I am building the analysis table:
>
>           -- output of sessionInfo():
>
>         source("http://bioconductor.org/biocLite.R")
>         library(affy)
>
>         filenames<- c("1.CEL","2.CEL")
>
>         affy.data<- ReadAffy(filenames = as.character(filenames))
>         platform<- annotation(affy.data),".db"
>         biocLite(platform)
>         library(platform)
>
>         eset_rma<- rma(affy.data)
>         eset_pma<- mas5calls(affy.data)
>         my_frame<- data.frame(exprs(eset_rma),
>         assayDataElement(eset_pma, "se.exprs"))
>         my_frame<- my_frame[, sort(names(my_frame))]
>         write.table(my_frame, file="export.tsv", sep="\t", col.names = NA)
>
>
>     ens <- select(mouse4302.db, featureNames(eset_pma), "ENSEMBL")
>
>     If all the probeset IDs in 'ens' and 'my_frame' match up, you can
>     simply cbind() to my_frame. I assume they will, but I would check
>     to be sure. Otherwise you can just merge().
>
>     Best,
>
>     Jim
>
>
>
>
>         --
>         Sent via the guest posting facility at bioconductor.org
>         <http://bioconductor.org>.
>
>         _______________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>         Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>     -- 
>     James W. MacDonald, M.S.
>     Biostatistician
>     University of Washington
>     Environmental and Occupational Health Sciences
>     4225 Roosevelt Way NE, # 100
>     Seattle WA 98105-6099
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099