[BioC] gene set enrichment analysis of RNA-Seq data
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Apr 14 03:41:29 CEST 2012
Dear Julie,
Subject sampling (used by GSEA and other softwares) actually makes quite
strong distributional assumptions, in that the units being permuted are
treated as independent and exchangeable under the null hypothesis. In
particular, this assumes that the units are homoscedastic. However
RNA-Seq counts for different libraries can be of very different sizes, and
hence will be heteroscedastic. And this will be so even under the null
hypothesis when the sequencing depths are different.
This one of the reasons why I prefer parametric gene set methods (roast,
camera, romer), because heteroscedasticity and dependence between the
samples can be unravelled.
Heteroscedasticity will be less a problem when the library sizes are
similar or if the counts are large. When the biological CV in the data is
relatively large, the E-values from voom() for each gene become roughly
homoscedastic relatively quickly at moderate counts sizes.
The idea of voom() is that the output values can be treated as continuous.
This can never be perfect for low counts, but genes with all low counts
are never going to be significantly DE anyway. In my experience, getting
the mean-variance relationship correct is more important than the exact
distibutional law or distinguishing between discrete and continuous.
Best wishes
Gordon
--------------- original message ----------------------
[BioC] gene set enrichment analysis of RNA-Seq data
Julie Leonard julie.leonard at syngenta.com
Fri Apr 13 22:52:27 CEST 2012
Thanks Gordon-
I'll definitely look into using camera. I have a naive question
though. If I use voom() to transform the RNA-Seq count data, does this
transformation change the data from discrete to continuous? And if so,
can I just use the E values in the EList object that is outputted from
voom() as input to GSEA (Broad version)? I would assume I wouldn't need
the weights b/c GSEA does not assume a specific distribution (it is
empirically calculated via subject sampling), so I wouldn't need to adjust
the data for heteroscedasticity to fit a "normal" or any other pre-set
distribution??
Thanks,
Julie
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list