[Bioc-sig-seq] Low-complexity read filtering/trimming [PolyA removal]

Thu Mar 12 16:50:10 CET 2009

Hi,

I was writing to check if there is a usable poly-A removal function to
remove the poly-reads where all bases are A's .. From what I understand,
this happens because of a constant intensity originating from a spec or
edges of the lane.

I will search for the same, but I am also looking for a start-up set of
commands to load the requisite libraries along with ShortReads to get
onto this analysis.

Cheers,
Sumit

-----Original Message-----
From: bioc-sig-sequencing-bounces at r-project.org
[mailto:bioc-sig-sequencing-bounces at r-project.org] On Behalf Of Cei
Abreu-Goodger
Sent: Sunday, February 22, 2009 6:23 PM
To: bioc-sig-sequencing at r-project.org
Subject: [Bioc-sig-seq] Low-complexity read filtering/trimming

Hi all,

I've been playing around with some Solexa small-RNA reads using 
ShortRead and Biostrings. I've used the 'trimLRPatterns' function to 
remove adapter sequence, and I've been trying to remove low-complexity 
sequences with 'srFilter'. I would first really like to congratulate all

the people involved for the great work. There are two situations in 
which I would be grateful for some suggestions, though:

1) I have many "low-complexity" reads. Some are simply polyA, polyC, 
etc. But some others are runs of "ATATAT" or "CACACACA", etc. Previously

I would have used "dust" on the command line to filter out this kind of 
read in a fasta file. Any ideas on how to achieve similar functionality 
in the ShortRead world?

2) For some reads I may have a "N-rich" patch inside the read, for
example:
AATAAAGTGCTTACAGTGNNNNTNNATNCAATACCG

I would ideally like to trim of everything starting at the "N-rich" 
part. I was trying to implement something with 'vmatchPattern', but if I

allow for mismatches (for a more flexible search) I will also get hits 
starting before the run of Ns.

Many thanks,

Cei

sessionInfo()

R version 2.9.0 Under development (unstable) (2009-02-13 r47919)
i386-apple-darwin9.6.0

locale:
C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] ShortRead_1.1.39   lattice_0.17-20    BSgenome_1.11.9 
Biostrings_2.11.28
[5] IRanges_1.1.38     Biobase_2.3.10

loaded via a namespace (and not attached):
[1] Matrix_0.999375-20 grid_2.9.0

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing