[Bioc-sig-seq] Supplying own variance functions and adjusted counts to a DESeq dataset

Sat Jul 16 13:00:12 CEST 2011

[repost, as my original post of yesterday somehow got dropped by the 
mailing list manager]

Hi Sean

On 2011-07-14 21:54, Sean Ruddy wrote:
> I have a RNA-Seq count data set that requires separate offset values for
> each tag and sample. DESeq does not appear to take a matrix of offset values
> (unlike edgeR) in any of its functions so I've carried out the analysis
> manually, ie. calculating a size factor for each tag of each sample,
> adjusting the counts, then proceeding to calculate means and variances of
> the adjusted counts, and finally fitting a curve for each condition to the
> mean-var plot using locfit().
>
> Essentially, I'd like to put these variance functions (or at least all the
> predicted variances) and adjusted counts inside a DESeq object so that I can
> take advantage of the other functions DESeq offers, tests, plots, etc...

We refactored thing a bit in the devel version, and it is now easier to 
inject your own variance estimates.

If you now run 'estimateDispersions', it adds columns 'disp_<cond>' 
(where <cond> is the name a condition, or "pooled" or "blind", depending 
on the "method" argument) to the feature data slot. If you want to use 
your own dispersion estimation scheme, you can just put values there, 
and the testing functions will use them.

However, I understand that you are actually happy with the estimation, 
you just want to pass gene-specific size factors, presumably to correct 
for GC biases. Our planned next step in our refactoring effort was to 
offer a slot, where you would pass a matrix of values, of the same 
dimensions as the count table, wich will be multiplied by the size 
factors each time they are used. From your post, I learned that the 
edgeR authors were again faster then we ;-) and have already added such 
a feature. As demand for this will increase (e.g. to interface to the 
new 'cqn' package that Hansen, Irizarry and Wu announced in their recent 
preprint), we should better add it, too, I guess.

Until then, have a look at the source code of DESeq: You will notice 
that we separated well the interface functions that deal with the 
CountDataSet objects, and the calculation functions that just work on 
matrices. So, if you want to use a functionality that should be there 
but is hard to use due to the format of the CountDataSet object, you can 
typically call the core function directly. For example, the function 
'estimateAndFitDispersionsFromBaseMeansAndVariances' takes a list of 
mean and dispersion and returns a mean-dispersion fit.

   Simon