[BioC] edgeR i get 377 significant genes where in DESeq i got 0
Simon Anders
anders at embl.de
Wed Mar 14 08:41:12 CET 2012
Hi
On 2012-03-14 00:16, Gordon K Smyth wrote:
> Probably you haven't made a mistake. In our experience, this is the
> typical behaviour of the two packages.
>
> The DESeq people should be applauded for trying something different, but
> commonsense would tell you that setting dispersions equal to the
> "maximum" of two extremes is likely to be conservative, especially when
> there are so few replicates.
Of course, I need to chime in here.
First of all, the poster did indeed make a mistake, see below.
Second, in our experience, the use of the maximum rule, while admittedly
looking overly conservative, costs surprisingly little power for typical
data. We performed simulations that convinced at least us that this
simple scheme is more robust than the WML scheme in edgeR, which, in our
hands, failed to control type-I error when presented with simulated data
with few replicates when the dispersion values were drawn from rather
wide distributions. Of course, a systematic comparison is still lacking
and should maybe be done by somebody more unbiased. In my opinion, such
a comparison should be based on a simulation study that tests how the
methods deal with simulated data with true dispersion values drawn from
distributions of different shapes and widths, modeled after real data
where available.
Now, to Pap's data set:
>> Hi,
>> Assuming i have 2 files:
>> 1's have 1,000,000 reads- one condition
>> 2's have 3,000,000 reads- second condition
Pap has two samples in total, not two replicates per condition, and so
the whole discussion above is not applicable anyway.
In this case, a proper statistical analysis is not possible. We can try
to get at least something with workarounds, though.
EdgeR used to switch to Poisson mode if presented with data without
replication, i.e., it assumed zero biological variation, which, of
course, leads to a large number of hits, which one cannot expect to be
reproducible. Given that there are only 377 hits, this seems to have
changed, and the edgeR authors will be able to comment on that.
DESeq offers the method "blind" to deal with data without replicates,
where it assumes that most genes are not differentially expressed and
hence estimates the dispersion from a comparison across the two samples.
Only those genes that "stick out" by showing much stronger differences
than most genes will be reported.
This method cannot be combined with the "maximum" rule, because then,
every gene that is "sticking out" would be compared to itself.
This is why this command here
>> estimateDispersions(cds,method="blind",sharingMode="maximum",fitType="local")
produces a warning informing the user that 'method="blind"' should only
be used together with 'sharingMode="fit-only"'.
Pap, you may have overlooked this warning. Maybe, I should maybe change
it to an error.
Please try again with 'sharingMode="fit-only"' and let us know what you get.
Simon
More information about the Bioconductor
mailing list