[Bioc-sig-seq] Low-complexity read filtering/trimming

Mon Feb 23 21:59:46 CET 2009

Hi Cei,

Here is a suggestion for question 1).

DUST seems to have been designed to mask regions in long DNA sequences.
They count tri-nucleotide frequencies in a 64-nucleotide sliding windows
so, strictly speaking, the algo doesn't work anymore with short reads.
In addition your problem seems simpler to me, because, if I understand
correctly, you don't want to mask regions within the reads, but just
want to keep or drop the entire read based on its "complexity".
The notion of "complexity" as defined by DUST could be reused though, by
applying trinucleotideFrequency() to the DNAStringSet object containing
the reads, replacing all coefficients in the resulting matrix by its
square, applying rowSums() to it, and finally using the resulting score
to filter the reads (high score meaning low complexity):

   library(ShortRead)
   rfq <- readFastq("some_path", pattern="s_1_sequence.txt")
   dict0 <- sread(rfq)

   tnf <- trinucleotideFrequency(dict0)
   scores <- rowSums(tnf * tnf)

   table(scores)  # choose a cut-off value
   clean_dict <- dict0[scores < cut_off]

There is one technical problem though: the top-level loop in
trinucleotideFrequency() is still written in R so that won't work
for millions of reads. The temporary workaround is to implement
a vcountPDict-based version of trinucleotideFrequency():

   trinucleotideFrequency2 <- function(x)
   {
     all_triplets <- DNAStringSet(mkAllStrings(c("A", "C", "G", "T"), 3))
     pd <- PDict(all_triplets)
     t(vcountPDict(pd, x))
   }

trinucleotideFrequency2() will take about 20 seconds on a
4-million 35-mers DNAStringSet object.

I calculated the scores for a set of 4.47 millions unaligned
35-mers from a single lane of a real Solexa experiment and got:

   > summary(scores)
      Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     22.00   49.00   55.00   66.95   63.00 1089.00

   > dict0[scores >= 200]
     A DNAStringSet instance of length 63331
           width seq
       [1]    35 AAAAAACCAAAACCAAACAAATAAAAACCCCAAAT
       [2]    35 GATAGATAGATAGATAGATAGATAGGGTAGATAGG
       [3]    35 GTGTGTGTGTGTGTGTGTGTATGTGTGTGTGCGTT
       [4]    35 GGTAGGTAGGTAGGTCAGGTAGGTAGGTAGGTAGG
       [5]    35 TGGTTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
       [6]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
       [7]    35 GTGTGTCTGTGTGTCTGTGTGTCTGTGTGTGTCTG
       [8]    35 GTATGTGTGCTTTGTGTGTGTGGTGTGTGTGTGGT
       [9]    35 GAGAGTGTGTGTGTGAGAGAGAGTGTGTGTGTGAG
       ...   ... ...
   [63323]    35 AAAAAAAAAAAAAAAAAAAAACGAAGAAAAAAAAG
   [63324]    35 AAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAACA
   [63325]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
   [63326]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
   [63327]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
   [63328]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
   [63329]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
   [63330]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
   [63331]    35 AAAAAAAGAAAAAAAAAAAAAAAAGAGAAAAAAAA

Note that 1089 is the score obtained by the poly-As, poly-Cs,
etc... (1089 = 33^2, this is the highest possible score for a
35-mer).

H.

Cei Abreu-Goodger wrote:
> Hi all,
> 
> I've been playing around with some Solexa small-RNA reads using 
> ShortRead and Biostrings. I've used the 'trimLRPatterns' function to 
> remove adapter sequence, and I've been trying to remove low-complexity 
> sequences with 'srFilter'. I would first really like to congratulate all 
> the people involved for the great work. There are two situations in 
> which I would be grateful for some suggestions, though:
> 
> 1) I have many "low-complexity" reads. Some are simply polyA, polyC, 
> etc. But some others are runs of "ATATAT" or "CACACACA", etc. Previously 
> I would have used "dust" on the command line to filter out this kind of 
> read in a fasta file. Any ideas on how to achieve similar functionality 
> in the ShortRead world?
> 
> 2) For some reads I may have a "N-rich" patch inside the read, for example:
> AATAAAGTGCTTACAGTGNNNNTNNATNCAATACCG
> 
> I would ideally like to trim of everything starting at the "N-rich" 
> part. I was trying to implement something with 'vmatchPattern', but if I 
> allow for mismatches (for a more flexible search) I will also get hits 
> starting before the run of Ns.
> 
> Many thanks,
> 
> Cei
> 
> 
> 
> sessionInfo()
> 
> R version 2.9.0 Under development (unstable) (2009-02-13 r47919)
> i386-apple-darwin9.6.0
> 
> locale:
> C
> 
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
> 
> other attached packages:
> [1] ShortRead_1.1.39   lattice_0.17-20    BSgenome_1.11.9 
> Biostrings_2.11.28
> [5] IRanges_1.1.38     Biobase_2.3.10
> 
> loaded via a namespace (and not attached):
> [1] Matrix_0.999375-20 grid_2.9.0
> 
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319