[Bioc-sig-seq] Low-complexity read filtering/trimming [PolyA removal]

Thu Mar 12 18:49:22 CET 2009

Sumit,
The ShortRead package uses a convention where filters can be used to 
weed out unwanted data. One of the filters is the polynFilter, which 
filters out reads with excessive amounts of the selected nucleotides. 
There is an unfortunate bug in polynFilter when only one nucleotide type 
is chosen, but I just fixed it in the svn repository and it will be come 
available on bioconductor.org in a day or so. Here is an example of 
filtering out reads with 32 or more A's in them using the polynFilter 
function (this operation filtered out 2 reads with the example data):

 > suppressMessages(library(ShortRead))
 > sp <- SolexaPath(system.file("extdata", package="ShortRead"))
 > aln <- readAligned(sp, "s_2_export.txt") # Solexa export file, as example
 > polyAFilt <- polynFilter(threshold = 32, nuc = "A")
 > aln
class: AlignedRead
length: 1000 reads; width: 35 cycles
chromosome: NM NM ... chr5.fa 29:255:255
position: NA NA ... 71805980 NA
strand: NA NA ... + NA
alignQuality: NumericQuality
alignData varLabels: run lane ... y filtering
 > aln[polyAFilt(aln)]
class: AlignedRead
length: 998 reads; width: 35 cycles
chromosome: NM NM ... chr5.fa 29:255:255
position: NA NA ... 71805980 NA
strand: NA NA ... + NA
alignQuality: NumericQuality
alignData varLabels: run lane ... y filtering
 > sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-23 r47990)
i386-apple-darwin9.6.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] ShortRead_1.1.44   lattice_0.17-20    BSgenome_1.11.11   
Biostrings_2.11.39
[5] IRanges_1.1.47   

loaded via a namespace (and not attached):
[1] Biobase_2.3.10 grid_2.9.0     hwriter_1.1  

Patrick

Middha, Sumit wrote:
> Hi,
>
> I was writing to check if there is a usable poly-A removal function to
> remove the poly-reads where all bases are A's .. From what I understand,
> this happens because of a constant intensity originating from a spec or
> edges of the lane.
>
> I will search for the same, but I am also looking for a start-up set of
> commands to load the requisite libraries along with ShortReads to get
> onto this analysis.
>
> Cheers,
> Sumit
>
> -----Original Message-----
> From: bioc-sig-sequencing-bounces at r-project.org
> [mailto:bioc-sig-sequencing-bounces at r-project.org] On Behalf Of Cei
> Abreu-Goodger
> Sent: Sunday, February 22, 2009 6:23 PM
> To: bioc-sig-sequencing at r-project.org
> Subject: [Bioc-sig-seq] Low-complexity read filtering/trimming
>
> Hi all,
>
> I've been playing around with some Solexa small-RNA reads using 
> ShortRead and Biostrings. I've used the 'trimLRPatterns' function to 
> remove adapter sequence, and I've been trying to remove low-complexity 
> sequences with 'srFilter'. I would first really like to congratulate all
>
> the people involved for the great work. There are two situations in 
> which I would be grateful for some suggestions, though:
>
> 1) I have many "low-complexity" reads. Some are simply polyA, polyC, 
> etc. But some others are runs of "ATATAT" or "CACACACA", etc. Previously
>
> I would have used "dust" on the command line to filter out this kind of 
> read in a fasta file. Any ideas on how to achieve similar functionality 
> in the ShortRead world?
>
> 2) For some reads I may have a "N-rich" patch inside the read, for
> example:
> AATAAAGTGCTTACAGTGNNNNTNNATNCAATACCG
>
> I would ideally like to trim of everything starting at the "N-rich" 
> part. I was trying to implement something with 'vmatchPattern', but if I
>
> allow for mismatches (for a more flexible search) I will also get hits 
> starting before the run of Ns.
>
> Many thanks,
>
> Cei
>
>
>
> sessionInfo()
>
> R version 2.9.0 Under development (unstable) (2009-02-13 r47919)
> i386-apple-darwin9.6.0
>
> locale:
> C
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
>
> other attached packages:
> [1] ShortRead_1.1.39   lattice_0.17-20    BSgenome_1.11.9 
> Biostrings_2.11.28
> [5] IRanges_1.1.38     Biobase_2.3.10
>
> loaded via a namespace (and not attached):
> [1] Matrix_0.999375-20 grid_2.9.0
>
>
>