[BioC] Up & Downregulated genes using DESeq
Simon Anders
anders at embl.de
Tue Mar 27 22:05:59 CEST 2012
HU Sunny
On 2012-03-27 21:24, Sunny Yu Liu wrote:
> Great explanation! Thanks!
> Have to clarify that when I mention foldchange 2, it is not log2.
> For differentiated expression genes, if taking high confident hits for
> downstream validation and analysis, it should be more consistent and
> reliable. However, I think false positive rate should increase for hits
> close to the cutoff line, and I am wondering how much this is
> caused/affected by the noise level of the method/data. Is any literature
> discussing about this? Say how the noise of data affects cutoff and
> false positive rate. If so, can cite them here? Thanks!
The whole point of statistical hypothesis testing is to give your
guarantees about the false positive rate. If this should be unclear,
remind yourself of the exact meaning of the term 'p value'.
In genomics, we usually work with false discovery rate (FDR) control: If
you adjust your p values with Benjamini-Hochberg (BH) and then cut at,
say, .1, this means that the FDR is <= 10%, i.e., the list of all genes
with padj < .1 should contain less than 10% false positives. By "should"
I mean that a sound hypothesis testing method must provide this if its
assumptions are fulfilled.
Benjamini and Hochberg's original false discovery rate (FDR) is a
property of the whole list of genes with adjusted p value below the
chosen threshold and hence some kind of average probability for a gene
to be a false positives. More recently, other authors have introduced
the concept of "local FDR", which aims to capture that genes in the list
have different probability of being a false positives depending on the
individual gene's signal-to-noise ratio.
I was recently searching for a good review on this that can be
recommended to non-statistician readers, but did not find anything
suitable. Maybe somebody else on the list has a recommendation?
About verification: Be careful to distinguish two situation:
(i) In a paper you present a list of differentially expressed genes and
wish to convince the reader that this list does not contain more than
10% false positives. To this end you perform verification experiments on
a selection of genes. Here, it would be cheating to only attempt to
verify high-confidence (low p value) hits, because it does not help
defending the claim that your significance cut-off is at the right place
if you do not try to genes close to the cut-off. (Once you pay attention
to it you will notice that this "cheat" is used quite commonly in
genomics papers.)
(ii) The purpose of your experiment is a screen to select promising
genes for detailed follow-up study. Then, you should, of course, try
your luck with the hits with highest confidence.
I hope these explanations helped.
Simon
More information about the Bioconductor
mailing list