[BioC] Up & Downregulated genes using DESeq

Tue Mar 27 22:05:59 CEST 2012

HU Sunny

On 2012-03-27 21:24, Sunny Yu Liu wrote:
> Great explanation! Thanks!
> Have to clarify that when I mention foldchange 2, it is not log2.
> For differentiated expression genes, if taking high confident hits for
> downstream validation and analysis, it should be more consistent and
> reliable. However, I think false positive rate should increase for hits
> close to the cutoff line, and I am wondering how much this is
> caused/affected by the noise level of the method/data. Is any literature
> discussing about this? Say how the noise of data affects cutoff and
> false positive rate. If so, can cite them here? Thanks!

The whole point of statistical hypothesis testing is to give your 
guarantees about the false positive rate. If this should be unclear, 
remind yourself of the exact meaning of the term 'p value'.

In genomics, we usually work with false discovery rate (FDR) control: If 
you adjust your p values with Benjamini-Hochberg (BH) and then cut at, 
say, .1, this means that the FDR is <= 10%, i.e., the list of all genes 
with padj < .1 should contain less than 10% false positives. By "should" 
I mean that a sound hypothesis testing method must provide this if its 
assumptions are fulfilled.

Benjamini and Hochberg's original false discovery rate (FDR) is a 
property of the whole list of genes with adjusted p value below the 
chosen threshold and hence some kind of average probability for a gene 
to be a false positives. More recently, other authors have introduced 
the concept of "local FDR", which aims to capture that genes in the list 
have different probability of being a false positives depending on the 
individual gene's signal-to-noise ratio.

I was recently searching for a good review on this that can be 
recommended to non-statistician readers, but did not find anything 
suitable. Maybe somebody else on the list has a recommendation?

About verification: Be careful to distinguish two situation:

(i) In a paper you present a list of differentially expressed genes and 
wish to convince the reader that this list does not contain more than 
10% false positives. To this end you perform verification experiments on 
a selection of genes. Here, it would be cheating to only attempt to 
verify high-confidence (low p value) hits, because it does not help 
defending the claim that your significance cut-off is at the right place 
if you do not try to genes close to the cut-off. (Once you pay attention 
to it you will notice that this "cheat" is used quite commonly in 
genomics papers.)

(ii) The purpose of your experiment is a screen to select promising 
genes for detailed follow-up study. Then, you should, of course, try 
your luck with the hits with highest confidence.

I hope these explanations helped.

   Simon