[BioC] Agi4x44PreProcess probe summarization bug?
Tobias Straub
tstraub at med.uni-muenchen.de
Fri Oct 23 17:52:33 CEST 2009
Hi Massimo,
On Oct 23, 2009, at 4:30 PM, Massimo Pinto wrote:
> Hey Tobias
> I have accordingly defined a function called summarize.probe.tobias(),
> saved into a local file, and sourced it.
> Upon calling it from my R script, there are no R crashes. The result
> of the operation has the same size of the RGlist that was created
> using the putatively wrong summarize.probe()
>
>> class(ddPROC.bis)
> [1] "RGList"
> attr(,"package")
> [1] "limma"
>> dim(ddPROC.bis)
> [1] 25574 24
>> dim(ddPROC)
> [1] 25574 24
>
> I thought you were aiming at a summarization here, so I would have
> expected ddPROC.bis to be smaller than ddPROC. I will make some
> comparisons between the readings of duplicated probes in the two
> RGlists ASAP.
the size is expected to be the same. the problem with the original
function is, that the medians are computed but not written into the
new RGList. The signal level of the first probe in the set of probes
mapping to one gene are reported.
>
> If troubleshooting is needed, I may need to compare the source code to
> the original version that is made available through the Agi4x44
> package 1.4.0. That ought to be somewhere...but where is the R code
> for the Agi4x44 package? There are a few source files on my HD but
> none looks pertinent - none contains the source of summarize.probe().
> I have tried to download and install the package from source code,
> again, but that did not leave any source code on my hard disk,
> apparently.
if you fetch http://bioconductor.org/packages/release/bioc/src/contrib/Agi4x44PreProcess_1.4.0.tar.gz
and unpack it do your desktop e.g. you will find in the R folder
contained a file named summarize.probe.R which is the original function.
anyway this code is very very greedy. one could easily do better and
faster code...
Anyway I followed Francis' suggestion and found that summarization
based on the probe closest to the 3' end is the most accurate and
precise way to do on (my) agilent arrays (drosophila)
However, it is not trivial to get the 'good' most 3' prime probe as
there are many probes mapping to more than one gene that I would
recommend to skip.
Alternatively summarization by keeping just the probe with the highest
variance is also an option.
> Can you point me to a document on the web where these basics steps are
> explained?
which steps are you refrerring to?
by the way: which array/organism are you using. and are you doing one
or dual color experiments?
best
T.
> Thank you in advance
> Massimo
>
>
>
>
> On Wed, Oct 21, 2009 at 8:52 AM, Tobias Straub
> <tstraub at med.uni-muenchen.de> wrote:
>> hi Massimo,
>>
>> you can simply copy/paste it into your R console. Otherwise paste
>> it into a
>> text file and source it from within R. >source("path/to/textfile")
>> then you
>> can call summarize.probe as before.
>>
>> best
>> T.
>>
>> On Oct 20, 2009, at 12:45 PM, Massimo Pinto wrote:
>>
>>> Hi Tobias,
>>>
>>> I have stumbled again into problems with Agi4xPreProcess and came
>>> back
>>> to this conversation that was started on the BioC list. I have never
>>> tried re-defining summarize.probe(), but as I try it, should I put
>>> it
>>> into a file which then I just 'load' from my R session? Sorry for
>>> what
>>> may seem a silly question, but I have not playing with defining or
>>> re-defining functions. I should probably read a piece of basic
>>> relevant documentation first.
>>>
>>> I have started yet another thread on Agi4x44, on a not-so-separate
>>> topic, on the list:
>>> https://stat.ethz.ch/pipermail/bioconductor/2009-October/030100.html
>>>
>>> this is because I ran into a duplicated gene problems much later,
>>> after my gene ontology analysis:
>>> https://stat.ethz.ch/pipermail/bioconductor/2009-October/030076.html
>>>
>>> Yours Truly,
>>> Massimo
>>>
>>> Massimo Pinto
>>> Post Doctoral Research Fellow
>>> Enrico Fermi Centre and Italian Public Health Research Institute
>>> (ISS),
>>> Rome
>>> http://claimid.com/massimopinto
>>>
>>>
>>>
>>> On Tue, Aug 4, 2009 at 11:15 PM, Massimo Pinto
>>> <pintarello at gmail.com>
>>> wrote:
>>>>
>>>> Thank you Tobias,
>>>> I may give it a try as soon as I have sorted out a factorial design
>>>> problem that I am having. Strictly speaking, this should come after
>>>> pre-processing, but it's been bugging my mind a lot lately!
>>>> Yours,
>>>> Massimo
>>>>
>>>> Massimo Pinto
>>>> Post Doctoral Research Fellow
>>>> Enrico Fermi Centre and Italian Public Health Research Institute
>>>> (ISS),
>>>> Rome
>>>> http://claimid.com/massimopinto
>>>>
>>>>
>>>>
>>>> On Tue, Aug 4, 2009 at 8:24 PM, Tobias
>>>> Straub<tstraub at med.uni-muenchen.de> wrote:
>>>>>
>>>>> Hi Massimo,
>>>>>
>>>>> in principle the effect should not be dramatic if the replicate
>>>>> signals
>>>>> are
>>>>> not too different. it's rather easy to correct the function by
>>>>> overriding it
>>>>> with the correct code. this is my quick fix, not tested at all.
>>>>> have a look and try it out.
>>>>>
>>>>> best
>>>>> Tobias
>>>>>
>>>>> ######
>>>>>
>>>>> `summarize.probe` <-
>>>>> function(ddFILT,makePLOT,targets) {
>>>>>
>>>>> cat("SUMMARIZATION OF non-CTRL PROBES ","\n")
>>>>> cat("\n")
>>>>> if (!is(ddFILT, "RGList")){
>>>>> stop("'input' must be a RGList")
>>>>> if (is.null(dim(ddFILT)[1])) {
>>>>> stop("'input' is empty")
>>>>> }
>>>>> }
>>>>>
>>>>> if(missing(targets)){
>>>>> stop("'targets' is missing ")
>>>>> }
>>>>>
>>>>> if("GErep" %in% colnames(targets)){
>>>>> GErep=targets$GErep
>>>>> nGE=sum(table(table(GErep)))
>>>>> }else{
>>>>> stop("'targets' needs 'GErep' field")
>>>>> }
>>>>>
>>>>> ddDUP=ddFILT
>>>>> dup=which(duplicated(ddDUP$genes$ProbeName)==TRUE)
>>>>>
>>>>> if(length(dup) == 0){
>>>>> ddPROC=ddFILT
>>>>> stop("NOT DUPLICATED ProbeName in chip")
>>>>> }
>>>>>
>>>>>
>>>>> Ldup=length(dup)
>>>>> reps=table(ddDUP$genes$ProbeName[dup])
>>>>> t=table(reps)
>>>>> rN=names(reps)
>>>>> Lreps=length(rN)
>>>>>
>>>>> nARR=dim(ddDUP)[2]
>>>>>
>>>>> if(Lreps !=0){
>>>>> for(i in 1:Lreps){
>>>>> index=which(ddDUP$genes$ProbeName==rN[i])
>>>>> MED=apply(ddDUP$G[index,],2,median)
>>>>> for (k in 1:length(index)) {
>>>>> ddDUP$G[index,][k,] <- MED
>>>>> }
>>>>> }
>>>>> ddPROC=ddDUP[-dup,]
>>>>> } else{
>>>>> ddPROC=ddDUP
>>>>> }
>>>>>
>>>>> # cat("REPLICATED NonCtrl",Lreps,"\n")
>>>>> # cat("DISTRIBUTION OF REPLICATED NonControl
>>>>> Probes","\n")
>>>>> # print(t)
>>>>> # cat("# REPLICATED (redundant)
>>>>> probeNames",Ldup,"\n")
>>>>> # cat("\n")
>>>>> cat("SUMMARIZED DATA: ",dim(ddPROC),"\n")
>>>>>
>>>>>
>>>>> cat("------------------------------------------------------","\n")
>>>>>
>>>>>
>>>>> if(!missing(makePLOT)) {
>>>>> if(makePLOT){
>>>>>
>>>>> colorfill="green"
>>>>> maintitle="NORMALIZED & SUMMARIZED SIGNAL"
>>>>>
>>>>> X11()
>>>>> par(mfrow=c(1,1))
>>>>> plotDensity(ddPROC$G,maintitle)
>>>>>
>>>>> X11()
>>>>> par(mfrow=c(1,1))
>>>>> BoxPlot(ddPROC$G,maintitle,colorfill)
>>>>>
>>>>> X11()
>>>>> maintitle="NORMALIZED & SUMMARIZED DATA - RLE "
>>>>> RLE(ddPROC$G,maintitle,colorfill)
>>>>>
>>>>> X11()
>>>>> par(mfrow=c(2,2),ask=T)
>>>>> maintitle="summ data"
>>>>> MVAplotMED(ddPROC$G,colorfill,maintitle)
>>>>>
>>>>> X11()
>>>>> par(mfrow=c(1,1))
>>>>> hierclus(ddPROC$G,GErep,methdis="euclidean",
>>>>> methclu="complete",sel=FALSE,100)
>>>>>
>>>>> }
>>>>> }
>>>>>
>>>>> return(ddPROC)
>>>>> } # end function
>>>>>
>>>>> #####
>>>>>
>>>>> On Aug 4, 2009, at 2:18 PM, Massimo Pinto wrote:
>>>>>
>>>>>> Hi Tobias,
>>>>>> I am trying to figure out what would be the effect of this
>>>>>> potential
>>>>>> bug.
>>>>>> Have you tried to correct the function to see what happens when
>>>>>> MED is
>>>>>> returned?
>>>>>> Massimo
>>>>>>
>>>>>> --
>>>>>> Massimo Pinto
>>>>>> Post Doctoral Research Fellow
>>>>>> Enrico Fermi Centre and Italian Public Health Research
>>>>>> Institute (ISS),
>>>>>> Rome
>>>>>> http://claimid.com/massimopinto
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =================================================================
>>>>>> L'Istituto Superiore di Sanità (ISS) è tra i beneficiari dei
>>>>>> proventi
>>>>>> del
>>>>>> 5 per mille dell'IRPEF.
>>>>>> Nella scheda allegata alla dichiarazione dei redditi è
>>>>>> sufficiente
>>>>>> apporre
>>>>>> la propria firma nel riquadro "Finanziamento della Ricerca
>>>>>> Sanitaria" e
>>>>>> indicare il Codice Fiscale dell'ISS, che è 80211730587, per
>>>>>> destinare
>>>>>> tali
>>>>>> fondi a sostegno dell'impegno scientifico dell'ISS a difesa della
>>>>>> salute di
>>>>>> tutti.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> Tobias Straub ++4989218075439 Adolf-Butenandt-Institute,
>>>>> München D
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>> ----------------------------------------------------------------------
>> Tobias Straub ++4989218075439 Adolf-Butenandt-Institute,
>> München D
>>
>>
>>
>>
>>
----------------------------------------------------------------------
Tobias Straub ++4989218075439 Adolf-Butenandt-Institute, München D
More information about the Bioconductor
mailing list