[BioC] Newbie methylation and stats question
Gustavo Fernández Bayón
gbayon at gmail.com
Wed Jun 20 09:40:57 CEST 2012
Well, to sum up, I wanted to thank you all for your kind and constructive answers.
Now I am getting to work through the references you provided. There are a lot of things to learn in this field and I am still at the beginning. If I still have problems, be sure I'll be back in the list for asking.
Regards,
Gus
---------------------------
Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)
El martes 19 de junio de 2012 a las 20:19, Tim Triche, Jr. escribió:
> Oh, I don't disagree that improper normalization is a bad idea. However, quantile normalization on the overall raw intensities (for example), assuming there are not gross differences in copy number, seems to work OK in many cases. I have seen people quantile normalizing on the summary statistics, which strikes me as perverse, but it's their data and their papers, not mine.
>
> I do tend to believe that methods which take into account the peculiarities of the platform are preferable to those that don't, but the former do exist; the trouble is that few systematic comparisons have been conducted, mostly on small or unusual datasets.
>
> As you point out, failing to take into account the differences between expression data (sparse transcripts, mostly absent) and genomic DNA (whether genotyping or "epigenotyping" arrays) can be expected to lead to poor results. I'm not a fan of blindly applying anything, hence the suggestion to plot the data first and ask questions thereafter :-)
>
> Cheers,
>
> --t
>
>
>
> On Tue, Jun 19, 2012 at 11:12 AM, Yao Chen <chenyao.bioinfor at gmail.com (mailto:chenyao.bioinfor at gmail.com)> wrote:
> > Hi Tim.
> >
> > I didn't mean we don't normalization methylation data because there is no standard method. What I want to say is the most of the existing normalization methods are derived from microarray which don't fit the methylation data. Most of these methods such as quantile normalization assume that most genes are not differentially expressed. However, In DNA methylation data, global hypomethylation is observed in many diseases such as cancer . Improper normalization method would erase the real biological difference.
> >
> > Jack
> >
> > 2012/6/19 Tim Triche, Jr. <tim.triche at gmail.com (mailto:tim.triche at gmail.com)>
> > > On Tue, Jun 19, 2012 at 7:46 AM, Yao Chen <chenyao.bioinfor at gmail.com (mailto:chenyao.bioinfor at gmail.com)> wrote:
> > > > As for as I know, this is no standard normalization for methylation data.
> > >
> > >
> > > As far as I know, there is no standard for microarray or RNAseq normalization either! But that doesn't mean an investigator should ignore the issue of technical (as opposed to biological) fixed or varying effects in their data. Especially if it could materially impact the outcome of a study. lumi offers quantile normalization, minfi & methylumi will do dye bias normalization, etc.
> > >
> > > For example, GenomeStudio appears to choose a reference array for dye bias adjustment within each batch of 450k samples, and correct using the normalization controls so that the chips in the run have equivalent Cy3:Cy5 bias to the reference. This is less than optimal if you then want to compare with another, separate batch. Personally I feel that it's better to start from IDATs.
> > >
> > > Another possibility is pernicious batch effects -- something like ComBat seems to work very well for those, usually, although as noted it's always up to the investigator to ensure that they are reporting on biologically (vs. technically) interesting differences.
> > >
> > > See for example http://www.biomedcentral.com/1755-8794/4/84
> > >
> > > > For me, I prefer keeping the raw value and just adjusting the technical variants. Anyone has better solution. Please let me know.
> > >
> > >
> > >
> > > See above. If the usual MDS plots indicate a supervised effect, one should fix it, preferably on the logit scale with ComBat, SVA, or something else appropriate to the task (i.e. if you're doing unsupervised analyses, a different method might be optimal).
> > >
> > > thanks,
> > >
> > > --t
> > >
> > >
> > >
> > > > Jack
> > > >
> > > > 2012/6/19 Tim Triche, Jr. <tim.triche at gmail.com (mailto:tim.triche at gmail.com)>
> > > > > Look up Andrew Jaffe and Rafa Irizarry's paper on "bump hunting" for
> > > > > regional differences. Or run a smooth over it (caveat: I just wrote
> > > > > smoothing "the way I want it" yesterday, after being provoked by a
> > > > > collaborator, so you might have to use lumi).
> > > > >
> > > > > The function "dmrFinder" in the "charm" package is specifically meant for
> > > > > this sort of thing.
> > > > >
> > > > > Also, if you're doing linear tests, be careful with normalization, mask
> > > > > your SNPs and chrX probes, and maybe use M-values (logit(beta)) for the
> > > > > task. The latter is more important for epidemiological datasets than
> > > > > something like cancer, where every single interesting result from M-value
> > > > > testing has been reproduced using untransformed beta values when I ran
> > > > > comparisons (e.g. HELP hg17 methylation differences for IDH1/2 mutants vs.
> > > > > Illumina hm450 differences for IDH1/2 mutants, the complete absence of any
> > > > > differences for TET2 mutants regardless of platform, etc.)
> > > > >
> > > > > Mark Robinson just chimed in, I see. Probably a good idea to read his
> > > > > reply carefully as well.
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 19, 2012 at 3:57 AM, Gustavo Fernández Bayón
> > > > > <gbayon at gmail.com (mailto:gbayon at gmail.com)>wrote:
> > > > >
> > > > > > Hi everybody.
> > > > > >
> > > > > > As a newbie to bioinformatics, it is not uncommon to find difficulties in
> > > > > > the way biological knowledge mixes with statistics. I come from the Machine
> > > > > > Learning field, and usually have problems with the naming conventions
> > > > > > (well, among several other things, I must admit). Besides, I am not an
> > > > > > expert in statistics, having used the barely necessary for the validation
> > > > > > of my work.
> > > > > >
> > > > > > Well, let's try to be more precise. One of the topics I am working more
> > > > > > right now is the analysis of methylation array data. As you surely now, the
> > > > > > final processed (and normalized) beta values are presented in a pxn matrix,
> > > > > > where there are p different probes and n different samples or individuals
> > > > > > from which we have obtained the beta-values. I am not currently working
> > > > > > with the raw data.
> > > > > >
> > > > > > Imagine, for a moment, that we have identified two regions of probes, A
> > > > > > and B, with a group of nA probes belonging to A, another group (of nB
> > > > > > probes) that belongs to B, and the intersection is empty. Say that we want
> > > > > > to find a way to show there is a statistically significant difference
> > > > > > between the methylation values of both regions.
> > > > > > As far as I have seen in the literature, comparisons (statistical tests)
> > > > > > are always done comparing the same probe values between case and control
> > > > > > groups of individuals or samples. For example, when we are trying to find
> > > > > > differentiated probes.
> > > > > >
> > > > > > However, if I think of directly comparing all the beta values from region
> > > > > > A (nA * n values) against the ones in region B (nB * n values) with a, say,
> > > > > > t test, I get the suspicion that something is not being done the way it
> > > > > > should. My knowledge of Biology and Statistics is still limited and I
> > > > > > cannot explain why, but I have the feeling that there is something formally
> > > > > > wrong in this approximation. Am I right?
> > > > > >
> > > > > > What I have done in similar experiments has been to find differentiated
> > > > > > probes, and then do a test to the proportion of differentiated probes to
> > > > > > total number of them, so I could assign a p-value to prove that there was a
> > > > > > significant influence of the region of reference.
> > > > > >
> > > > > > Several questions here: which could be a coherent approximation to the
> > > > > > regions A and B problem stated above? Is there any problem with methylation
> > > > > > data I am not aware of which makes only the in-probe analysis valid? Any
> > > > > > bibliographic references that could help me seeing the subtleties around?
> > > > > >
> > > > > > As you can see, concepts are quite interleaved in my mind, so any help
> > > > > > would be very appreciated.
> > > > > > Regards,
> > > > > > Gustavo
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---------------------------
> > > > > > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)
> > > > > >
> > > > > > _______________________________________________
> > > > > > Bioconductor mailing list
> > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > > Search the archives:
> > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *A model is a lie that helps you see the truth.*
> > > > > *
> > > > > *
> > > > > Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
> > > > >
> > > > > [[alternative HTML version deleted]]
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Bioconductor mailing list
> > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > >
> > >
> > >
> > >
> > >
> > > --
> > > A model is a lie that helps you see the truth.
> > >
> > > Howard Skipper (http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf)
>
>
>
> --
> A model is a lie that helps you see the truth.
>
> Howard Skipper (http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf)
More information about the Bioconductor
mailing list