[Bioc-sig-seq] Dealing with pileups/duplicates in RNAseq

Fri Apr 23 18:50:12 CEST 2010

Hi all,

Sorry for abusing the list (and *-seq terminology) as this isn't
really a Bioconductor-related question, but I was curious how you all
deal with "pileups" in RNAseq data. By pileup I mean separate
observations of the same read (ie. two++ different reads that map to
the same exact genomic locus), aka duplicate reads.

I'm pretty sure it's common practice to remove them in ChIP-seq
experiments since, I believe, they are usually assumed to be PCR
artifacts, but with genes being able to vary in their expression
level, removing all of them probably isn't a given.

That having been said, I have been removing them anyway. I think I've
seen some references to only keep N-many reads that map to the same
place, where N seems to be arbitrarily chosen at a global scale.

I guess it makes the most sense to probably determine N on a
gene-by-gene basis, perhaps by quantifying the expression of the gene
based on its uniquely-appearing reads, though.

So, I'm just curious if/how you folks are tackling this issue.

Thanks,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact