[BioC] Checking RNAseq DESeq pipeline
Wolfgang Huber
whuber at embl.de
Wed Mar 7 12:59:41 CET 2012
Dear Nat
the work flow you describe looks OK. I would recommend doing more
visualisation and QA/QC on the data to make sure that what you are
looking at is not confounded with something else (like reagent batches
or protocol variation).
Also, can you please make sure that you are using the most recent
version of DESeq, in particular for what I propose below:
http://bioconductor.org/packages/2.10/bioc/html/DESeq.html
Can you try something like
vsd <- getVarianceStabilizedData(cds)
pairs(vsd, pch=".")
library("arrayQualityMetrics")
arrayQualityMetrics(vsd)
Re PCR duplicates - did you use extensive PCR-amplication?
How do the read alignments look along the gene models?
Others may have more insight here, but if standard RNA-Seq was done, and
read alignment profiles are not excessively dominated by duplicates, I
think it is OK to use the data without duplicate removel. OTOH, it
shouldn't make too much of a difference.
Hope this helps,
Wolfgang
nathalie scripsit 03/06/2012 04:54 PM:
> Hi all,
> I have done an differential expression analysis using DESeq on a set of
> data and got a lot of differential expressed genes at the end. I would
> like people to check the code and the assumptions I made.
>
> A-Samples
> I am analysing 6 samples in total from 3 different cell types (E, S and H),
> S1and S2 ( biological replicates) H1, H2 and H3 ( biological replicate)
> E (no replicate)
> We took from one animal cells called S ( sample S1) we passaged several
> time these cells and isolated samples we called S2 ( sample S2)
> from the cell population S ,we took subclones which we cultivated with
> factors to obtain 3 independant cell lines of the same cell type which
> were derived independently from the S population, called H1 H2 and H3
> (biological replicates).
> AT last, another cell line was derived from clone H1 which we call E1
> (only sample).
>
> B-standard RNA seq was performed in these 6 samples
> AFter sequencing, we aligned the samples using topHat and count reads
> with HTSEQ python script.
> Duplicates haven't been removed in this process. I plan to do one
> analysis with removed duplicates.
>
> C-then once in DESeq I have created my count table and followed the
> vignette-I have test the diff Exp of condition F vs H
> > conds
> [1] E F F H H H
> Levels: E F H
> > head(countsTable)
> E F F.1 H H.1 H.2
> 1 789 195 235 1019 938 1048
> 2 1 0 0 11 17 31
> 3 543 325 460 818 883 644
> 4 103 89 103 114 87 96
> 5 281 41 102 328 245 387
> 6 1 2 1 9 8 7
> > cds <- newCountDataSet( countsTable, conds )
>
> > sizeFactors(cds)
> E F F.1 H H.1 H.2
> 1.0417242 0.6740896 0.8994357 1.2268668 1.1499580 1.1831515
>
> > cds<-estimateDispersions(cds)
> > str( fitInfo(cds) )
> List of 5
> $ perGeneDispEsts: num [1:54665] 0.000954 0.592346 0.023319 0.000245
> 0.055592 ...
> $ dispFunc :function (q)
> ..- attr(*, "coefficients")= Named num [1:2] 0.0581 1.8325
> .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
> ..- attr(*, "fitType")= chr "parametric"
> $ fittedDispEsts : num [1:54665] 0.061 0.2741 0.0612 0.0766 0.0669 ...
> $ df : int 3
> $ sharingMode : chr "maximum"
>
> >head(fData(cds))
> disp_pooled
> 1 0.06096110
> 2 0.59234620
> 3 0.06124489
> 4 0.07657296
> 5 0.06688364
>
> > str( fitInfo(cds) )
> List of 5
> $ perGeneDispEsts: num [1:54665] 0.000954 0.592346 0.023319 0.000245
> 0.055592 ...
> $ dispFunc :function (q)
> ..- attr(*, "coefficients")= Named num [1:2] 0.0581 1.8325
> .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
> ..- attr(*, "fitType")= chr "parametric"
> $ fittedDispEsts : num [1:54665] 0.061 0.2741 0.0612 0.0766 0.0669 ...
> $ df : int 3
> $ sharingMode : chr "maximum"
>
> > res <- nbinomTest( cds, "H", "F" )
> > head(res)
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval
> 1 1 616.515373 844.007629 275.276988 0.3261546 -1.6163720 6.817493e-06
> 2 2 9.990057 16.650096 0.000000 0.0000000 -Inf 2.662914e-03
> 3 3 594.493135 659.634053 496.781759 0.7531172 -0.4090537 2.522510e-01
> 4 4 99.251992 83.237929 123.273085 1.4809725 0.5665449 1.666241e-01
> 5 5 196.343734 269.163821 87.113605 0.3236453 -1.6275146 6.472665e-05
> 6 6 4.857542 6.736313 2.039386 0.3027452 -1.7238239 2.449359e-01
> padj
> 1 4.775233e-05
> 2 1.088826e-02
> 3 5.047015e-01
> 4 3.642458e-01
> 5 3.807362e-04
> 6 4.934803e-01
>
>
> plotDE <- function( res )
> plot(
> res$baseMean,
> res$log2FoldChange,
> log="x", pch=20, cex=.3,
> col = ifelse( res$padj < .1, "red", "black" ) )
>
>
> I have attached As pdf the plotDE( res ) , hist(res$pval, breaks=100,
> col="skyblue", border="slateblue", main="") and plotDispEsts(cds) for a
> basic idea of the data.
>
>
> I have ~10k genes which show a v low pvalue for DExpression which seem a
> lot.
>
> I have several questions,
> 1-Concerning the experimental design I have described in A , can the
> samples be considered as real bio replicate?
> 2-Would it be "Better" to get rid of duplicated reads prior to HTSeq
> 3-How can I be more stringent on the expression?
>
>
>
> many thanks
> Nat
>
>
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Best wishes
Wolfgang
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list