[BioC] how edgeR control outliers?
Gordon K Smyth
smyth at wehi.EDU.AU
Sun Mar 4 00:46:00 CET 2012
Dear Yuan,
Data analysis decisions are not made on the basis of one picture, and I
have not seen your other plots. However, the qqnorm plot suggests to me
that you do not actually have outliers, because there are no individual
points that stand out. Rather you have an extraordinarily large degree of
diversity in the tagwise dispersions, as evidenced by a large number of
qqnorm points above the line in the upper half of the plot. From an edgeR
point of view, I would suggest using a smaller value for prior.n. From a
biological point of view, I would wonder whether the two groups you are
comparing are truly homogeneous. I would wonder whether the tagwise
dispersions are reflectly differential expression with groups.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth
On Thu, 1 Mar 2012, Yuan Tian wrote:
Dear Gordon,
I did the qqplot following the instructions in your last email, and I got
a plot as attached. How can we interpret the results. According to the
gof() function with 0.1 adjusted p value cutoff, no genes are detected as
the outlier genes, but according to the qqplot, the fit seems to be not
very well.
Here I use tagwise dispersion values.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2012-03-01 at 8.25.38 PM.png
Type: image/png
Size: 28854 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/bioconductor/attachments/20120301/9a6c52ea/attachment.png>
-------------- next part --------------
Yuan
On Mar 1, 2012, at 2:50 PM, Gordon K Smyth wrote:
> Dear Yuan,
>
> The deviance is a standard quantity in generalized linear model theory,
analogous to the residual sum of squares in ANOVA. It is usually treated
as chisquare distributed, although this approximation can be rough in some
cases. See for example:
>
> http://en.wikipedia.org/wiki/Deviance_(statistics)
>
> Yes, when I said to test for outliers using the gof() function in
>
> https://stat.ethz.ch/pipermail/bioconductor/2012-January/043187.html
>
> I meant that outliers are those with large gof statistics. The
calculation of p-values to test for outliers is already done for you by
the gof() function.
>
> Figure 2 of the following article provides some plots of gof()
statistics:
>
> http://nar.oxfordjournals.org/content/early/2012/01/28/nar.gks042
>
> The plots are made by
>
> g <- gof(fit)
> z <- zscoreGamma(g$gof.statistics,shape=gof$df/2,scale=2)
> qqnorm(z)
>
> Another very useful diagnostic is to plot the tagwise dispersion against
abundance. Outliers may appear as large dispersions. In the
developmental version of edgeR, there is a function plotBCV() provided to
do this.
>
> Best wishes
> Gordon
>
>> Date: Wed, 29 Feb 2012 20:09:06 -0800
>> From: Yuan Tian <ytianidyll at ucla.edu>
>> To: Bioconductor mailing list <bioconductor at r-project.org>
>> Subject: [BioC] how edgeR control outliers?
>>
>> Dear all,
>>
>> I'm currently using edgeR to detect the differentially expressed genes
from a RNAseq datasets, and I'm also using the gof() function to test for
potential outliers. I have two questions regarding the outlier detection,
and would like to have your suggestions.
>>
>> 1) How the outlier is defined? Is it the gene that have a deviance
larger than a threshold? How is the deviance contained in the glmfit data
calculated?
>>
>> 2) In gof() function, it assumes the deviance should follow a
chi-squared distribution. But what is the statistic basis for this
assumption?
>>
>> Thanks!
>>
>> Yuan
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list