[BioC] gene set enrichment analysis of RNA-Seq data

Sat Apr 14 03:41:29 CEST 2012

Dear Julie,

Subject sampling (used by GSEA and other softwares) actually makes quite 
strong distributional assumptions, in that the units being permuted are 
treated as independent and exchangeable under the null hypothesis.  In 
particular, this assumes that the units are homoscedastic.  However 
RNA-Seq counts for different libraries can be of very different sizes, and 
hence will be heteroscedastic.  And this will be so even under the null 
hypothesis when the sequencing depths are different.

This one of the reasons why I prefer parametric gene set methods (roast, 
camera, romer), because heteroscedasticity and dependence between the 
samples can be unravelled.

Heteroscedasticity will be less a problem when the library sizes are 
similar or if the counts are large.  When the biological CV in the data is 
relatively large, the E-values from voom() for each gene become roughly 
homoscedastic relatively quickly at moderate counts sizes.

The idea of voom() is that the output values can be treated as continuous. 
This can never be perfect for low counts, but genes with all low counts 
are never going to be significantly DE anyway.  In my experience, getting 
the mean-variance relationship correct is more important than the exact 
distibutional law or distinguishing between discrete and continuous.

Best wishes
Gordon

--------------- original message ----------------------
[BioC] gene set enrichment analysis of RNA-Seq data
Julie Leonard julie.leonard at syngenta.com
Fri Apr 13 22:52:27 CEST 2012

Thanks Gordon-

     I'll definitely look into using camera.  I have a naive question 
though.  If I use voom() to transform the RNA-Seq count data, does this 
transformation change the data from discrete to continuous?  And if so, 
can I just use the E values in the EList object that is outputted from 
voom() as input to GSEA (Broad version)?  I would assume I wouldn't need 
the weights b/c GSEA does not assume a specific distribution (it is 
empirically calculated via subject sampling), so I wouldn't need to adjust 
the data for heteroscedasticity to fit a "normal" or any other pre-set 
distribution??

Thanks,
Julie

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}