[BioC] Checking RNAseq DESeq pipeline

Wed Mar 7 12:59:41 CET 2012

Dear Nat

the work flow you describe looks OK. I would recommend doing more 
visualisation and QA/QC on the data to make sure that what you are 
looking at is not confounded with something else (like reagent batches 
or protocol variation).

Also, can you please make sure that you are using the most recent 
version of DESeq, in particular for what I propose below: 
http://bioconductor.org/packages/2.10/bioc/html/DESeq.html

Can you try something like

vsd <- getVarianceStabilizedData(cds)
pairs(vsd, pch=".")

library("arrayQualityMetrics")
arrayQualityMetrics(vsd)

Re PCR duplicates - did you use extensive PCR-amplication?
How do the read alignments look along the gene models?
Others may have more insight here, but if standard RNA-Seq was done, and 
read alignment profiles are not excessively dominated by duplicates, I 
think it is OK to use the data without duplicate removel. OTOH, it 
shouldn't make too much of a difference.

Hope this helps,
	Wolfgang

nathalie scripsit 03/06/2012 04:54 PM:
> Hi all,
> I have done an differential expression analysis using DESeq on a set of
> data and got a lot of differential expressed genes at the end. I would
> like people to check the code and the assumptions I made.
>
> A-Samples
> I am analysing 6 samples in total from 3 different cell types (E, S and H),
> S1and S2 ( biological replicates) H1, H2 and H3 ( biological replicate)
> E (no replicate)
> We took from one animal cells called S ( sample S1) we passaged several
> time these cells and isolated samples we called S2 ( sample S2)
> from the cell population S ,we took subclones which we cultivated with
> factors to obtain 3 independant cell lines of the same cell type which
> were derived independently from the S population, called H1 H2 and H3
> (biological replicates).
> AT last, another cell line was derived from clone H1 which we call E1
> (only sample).
>
> B-standard RNA seq was performed in these 6 samples
> AFter sequencing, we aligned the samples using topHat and count reads
> with HTSEQ python script.
> Duplicates haven't been removed in this process. I plan to do one
> analysis with removed duplicates.
>
> C-then once in DESeq I have created my count table and followed the
> vignette-I have test the diff Exp of condition F vs H
>  > conds
> [1] E F F H H H
> Levels: E F H
>  > head(countsTable)
> E F F.1 H H.1 H.2
> 1 789 195 235 1019 938 1048
> 2 1 0 0 11 17 31
> 3 543 325 460 818 883 644
> 4 103 89 103 114 87 96
> 5 281 41 102 328 245 387
> 6 1 2 1 9 8 7
>  > cds <- newCountDataSet( countsTable, conds )
>
>  > sizeFactors(cds)
> E F F.1 H H.1 H.2
> 1.0417242 0.6740896 0.8994357 1.2268668 1.1499580 1.1831515
>
>  > cds<-estimateDispersions(cds)
>  > str( fitInfo(cds) )
> List of 5
> $ perGeneDispEsts: num [1:54665] 0.000954 0.592346 0.023319 0.000245
> 0.055592 ...
> $ dispFunc :function (q)
> ..- attr(*, "coefficients")= Named num [1:2] 0.0581 1.8325
> .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
> ..- attr(*, "fitType")= chr "parametric"
> $ fittedDispEsts : num [1:54665] 0.061 0.2741 0.0612 0.0766 0.0669 ...
> $ df : int 3
> $ sharingMode : chr "maximum"
>
>  >head(fData(cds))
> disp_pooled
> 1 0.06096110
> 2 0.59234620
> 3 0.06124489
> 4 0.07657296
> 5 0.06688364
>
>  > str( fitInfo(cds) )
> List of 5
> $ perGeneDispEsts: num [1:54665] 0.000954 0.592346 0.023319 0.000245
> 0.055592 ...
> $ dispFunc :function (q)
> ..- attr(*, "coefficients")= Named num [1:2] 0.0581 1.8325
> .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
> ..- attr(*, "fitType")= chr "parametric"
> $ fittedDispEsts : num [1:54665] 0.061 0.2741 0.0612 0.0766 0.0669 ...
> $ df : int 3
> $ sharingMode : chr "maximum"
>
>  > res <- nbinomTest( cds, "H", "F" )
>  > head(res)
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval
> 1 1 616.515373 844.007629 275.276988 0.3261546 -1.6163720 6.817493e-06
> 2 2 9.990057 16.650096 0.000000 0.0000000 -Inf 2.662914e-03
> 3 3 594.493135 659.634053 496.781759 0.7531172 -0.4090537 2.522510e-01
> 4 4 99.251992 83.237929 123.273085 1.4809725 0.5665449 1.666241e-01
> 5 5 196.343734 269.163821 87.113605 0.3236453 -1.6275146 6.472665e-05
> 6 6 4.857542 6.736313 2.039386 0.3027452 -1.7238239 2.449359e-01
> padj
> 1 4.775233e-05
> 2 1.088826e-02
> 3 5.047015e-01
> 4 3.642458e-01
> 5 3.807362e-04
> 6 4.934803e-01
>
>
> plotDE <- function( res )
> plot(
> res$baseMean,
> res$log2FoldChange,
> log="x", pch=20, cex=.3,
> col = ifelse( res$padj < .1, "red", "black" ) )
>
>
> I have attached As pdf the plotDE( res ) , hist(res$pval, breaks=100,
> col="skyblue", border="slateblue", main="") and plotDispEsts(cds) for a
> basic idea of the data.
>
>
> I have ~10k genes which show a v low pvalue for DExpression which seem a
> lot.
>
> I have several questions,
> 1-Concerning the experimental design I have described in A , can the
> samples be considered as real bio replicate?
> 2-Would it be "Better" to get rid of duplicated reads prior to HTSeq
> 3-How can I be more stringent on the expression?
>
>
>
> many thanks
> Nat
>
>
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Best wishes
	Wolfgang

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber