[BioC] edgeR i get 377 significant genes where in DESeq i got 0

Wed Mar 14 08:41:12 CET 2012

Hi

On 2012-03-14 00:16, Gordon K Smyth wrote:
> Probably you haven't made a mistake. In our experience, this is the
> typical behaviour of the two packages.
>
> The DESeq people should be applauded for trying something different, but
> commonsense would tell you that setting dispersions equal to the
> "maximum" of two extremes is likely to be conservative, especially when
> there are so few replicates.

Of course, I need to chime in here.

First of all, the poster did indeed make a mistake, see below.

Second, in our experience, the use of the maximum rule, while admittedly 
looking overly conservative, costs surprisingly little power for typical 
data. We performed simulations that convinced at least us that this 
simple scheme is more robust than the WML scheme in edgeR, which, in our 
hands, failed to control type-I error when presented with simulated data 
with few replicates when the dispersion values were drawn from rather 
wide distributions. Of course, a systematic comparison is still lacking 
and should maybe be done by somebody more unbiased. In my opinion, such 
a comparison should be based on a simulation study that tests how the 
methods deal with simulated data with true dispersion values drawn from 
distributions of different shapes and widths, modeled after real data 
where available.

Now, to Pap's data set:

>> Hi,
>> Assuming i have 2 files:
>> 1's have 1,000,000 reads- one condition
>> 2's have 3,000,000 reads- second condition

Pap has two samples in total, not two replicates per condition, and so 
the whole discussion above is not applicable anyway.

In this case, a proper statistical analysis is not possible. We can try 
to get at least something with workarounds, though.

EdgeR used to switch to Poisson mode if presented with data without 
replication, i.e., it assumed zero biological variation, which, of 
course, leads to a large number of hits, which one cannot expect to be
reproducible. Given that there are only 377 hits, this seems to have 
changed, and the edgeR authors will be able to comment on that.

DESeq offers the method "blind" to deal with data without replicates, 
where it assumes that most genes are not differentially expressed and 
hence estimates the dispersion from a comparison across the two samples. 
Only those genes that "stick out" by showing much stronger differences 
than most genes will be reported.

This method cannot be combined with the "maximum" rule, because then, 
every gene that is "sticking out" would be compared to itself.

This is why this command here

>> estimateDispersions(cds,method="blind",sharingMode="maximum",fitType="local")

produces a warning informing the user that 'method="blind"' should only 
be used together with 'sharingMode="fit-only"'.

Pap, you may have overlooked this warning. Maybe, I should maybe change 
it to an error.

Please try again with 'sharingMode="fit-only"' and let us know what you get.

   Simon