[Bioc-sig-seq] Adapter removal

Thu Jul 17 21:38:57 CEST 2008

Krys,
Building up on Martin's suggestion, I also recommend you take a look at 
the new pairwiseAlignment function in the development version of 
Biostrings. Since it uses an O(N^2) dynamic programming algorithm it 
isn't necessarily suited for genome wide analysis, but it is perfectly 
suited for aligning short strings and filtered out short reads by 
adapter strings. The pairwiseAlignment function can fit global, local, 
and ends-free (overlap) alignments with or without quality-based 
substitution penalties and with or without indels.

Below is a sample script run that uses pairwiseAlignment to filter out 
short reads based on the adapter. This script looks at four different 
prefix cutoff levels (8, 16, 24, 36) that you could use to filter your 
data and presents the results you would get when you use the equivalent 
to a perfect match to the first 8 characters as your cutoff.

Patrick

 > library(ShortRead)
 > sp <- SolexaPath("/path/to/solexa/expt")
 > pat <- "s_1_sequence.txt"
 > shortReads <- readFastq(analysisPath(sp), pattern = pat)
 > cleanedShortReads <- clean(shortReads) # no reads with uncalled bases
 > cleanedShortReads
class: ShortReadQ
length: 3506447 reads; width: 35 cycles
 > adapter <- DNAString("GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA")
 > system.time({
+   adapterScores <-
+     pairwiseAlignment(pattern = sread(cleanedShortReads), subject = 
adapter,
+                       patternQuality = 
quality(quality(cleanedShortReads)),
+                       subjectQuality = 99L,
+                       qualityType = "Solexa",
+                       type = "overlap",
+                       scoreOnly = TRUE)
+ })
   user  system elapsed
 92.834   2.040  94.873
 > firstNChar <- c(8, 16, 24, 32)
 > cutoffScores <- c(0, 0, 0, 0)
 > names(cutoffScores) <- firstNChar
 > for (i in seq_len(length(cutoffScores))) {
+ print(system.time({
+   cutoffScores[i] <-
+     max(pairwiseAlignment(pattern = sread(cleanedShortReads),
+                           subject = substring(adapter, 1, firstNChar[i]),
+                           patternQuality = 
quality(quality(cleanedShortReads)),
+                           subjectQuality = 99L,
+                           qualityType = "Solexa",
+                           type = "overlap",
+                           scoreOnly = TRUE))
+ }))
+ }
   user  system elapsed
 16.901   0.480  17.381
   user  system elapsed
 36.886   0.804  37.687
   user  system elapsed
 58.692   1.412  60.117
   user  system elapsed
 75.805   1.812  77.617
 > cutoffScores
       8       16       24       32
15.96356 31.92712 47.89068 63.82726
 > quantile(adapterScores, seq(0.9, 1, 0.005))
      90%     90.5%       91%     91.5%       92%     92.5%       
93%     93.5%
 9.909755 11.882821 13.878695 17.341880 21.739067 35.019755 50.056933 
55.612625
      94%     94.5%       95%     95.5%       96%     96.5%       
97%     97.5%
57.737140 58.951218 59.979969 60.780243 61.931897 62.586998 63.861232 
65.047516
      98%     98.5%       99%     99.5%      100%
66.623512 68.858673 69.453423 69.661290 69.773109
 > t(do.call("rbind", lapply(cutoffScores, function(x) 
table(adapterScores > x))))
            8      16      24      32
FALSE 3206960 3240999 3257733 3400456
TRUE   299487  265448  248714  105991
 > tables(cleanedShortReads[adapterScores >= cutoffScores["8"]], n = 
10)[["top"]]
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGA
                              63558                               62264
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTGA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGA
                              17832                               15939
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTATA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTAA
                              11554                                7391
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGTA
                               6452                                6343
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTTA GATCGGAAGAGCTCGTATGACGTCTTCTGCTTAGA
                               5724                                3878
 > sessionInfo()
R version 2.8.0 Under development (unstable) (2008-07-07 r46041)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods 
[8] base    

other attached packages:
[1] ShortRead_0.1.32  lattice_0.17-10   Biostrings_2.9.42 Biobase_2.1.7   

loaded via a namespace (and not attached):
[1] grid_2.8.0

Martin Morgan wrote:
> "Krys Kelly" <kak28 at cam.ac.uk> writes:
>
>   
>> I have inherited a pipeline for Solexa sequence data using Perl, Bioperl,
>> SSAHA and mySQL.  As an R/Bioconducter user I am interested in ShortRead and
>> BiostringsCinterfaceDemo.
>>
>> However, in the short term I need to use the current pipeline.  The imaging
>> is done by the Sequencing Facility and we get fastq files with the 3'
>> adapter still attached. The adapter removal is currently done by a Perl
>> script which just keeps sequences which match any number of letters in
>> [ACGT] followed by the first 8 letters of the adapter.  This seems pretty
>> crude (e.g. only using 8 letters, not allowing for mismatches, not allowing
>> for the diminishing quality along the length of the read).
>>     
>
> In the very latest Biostrings, and thanks to Herve's skills, you can
>
>   
>> library(ShortRead)
>> sp <- SolexaPath("/path/to/solexa/expt")
>> pat <- "s_1_0001_seq.txt" # or any regexp to read multiple files
>> seq <- readFastq(analysisPath(sp), patreadXStringColumns(baseCallPath(sp), pat,
>>     
> +                           list(NULL, NULL, NULL, NULL, "DNAString"))[[1]]
>   
>> seq <- clean(seq) # no reads with uncalled bases
>>     
>
> (or readFastq on analysisPath(sp), "s_1_sequence.txt", for instance,
> but these are 'filtered' and it's not really clear that that is the
> right place for a careful analysis to start), and then
>
>   
>> tables(seq)$top[1:5]
>>     
> GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGAT 
>                                  186                                  126 
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAT 
>                                   71                                   51 
> GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAA 
>                                   16 
> to remind me what the primer sequence looks like (!) and
>
>   
>> pd <- PDict(seq, max.mismatch=3)
>> subj <- DNAString("GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA")
>> cnts <- countPDict(pd, subj, max.mismatch=3)
>>     
>
> cnts now contains the number of times each read in seq matched the
> primer with up to 3 mismatches (could be more, within reason), of
> which there are
>
>   
>> sum(cnts!=0)
>>     
> [1] 565
>
> seq[cnts==0] gets me the non-primer sequences (though apparently this
> is not entirely satisfactory, see the second entry!)
>
>   
>> tables(seq[cnts==0])$top[1:5]
>>     
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GATCGGAAGAGCTCGGTATGCCGTCTTCTGCTTAGA 
>                                   71                                   12 
> GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG 
>                                    5                                    5 
> CACCTTTTTCAGTTTTCCTCGCCATATTTCACGTCC 
>                                    4 
>
> sessionInfo() tells me
>
> other attached packages:
> [1] ShortRead_0.1.32  lattice_0.17-10   Biostrings_2.9.42 Biobase_2.1.7    
>
>   
>> Google has not revealed any algorithms or code for this part of the
>> pipeline.  Does anyone know what algorithms are being used or, even better,
>> could anyone point me in the direction of some code?
>>     
>
> I guess you're asking in general. In terms of ELAND, I'm not really
> sure what it does about primers; I think it actually keeps them in,
> relying on failure to align to weed them out. With contaminants (which
> can be specified in a separate file) ELAND aligns the reads to the
> contaminants and discards matches. The code is 'Solexa Open Source' or
> some such, and should be available from your friendly Solexa rep; the
> contaminant stuff is in Gerald/GERALD.pl of the version of the code I
> have access to.
>
> There are some obvious additional problems, e.g., those poly-A
> reads. Tools in Biostrings (e.g., alphabetFrequency) can be used in
> these cases, too.
>
> All of this seems pretty ad hoc, which is not really what you were
> looking for, I suspect.
>
> Martin
>
>   
>> Thanks
>>
>> Krys
>>
>>
>> Dr Krystyna A Kelly
>> Senior Research Associate
>> David Baulcombe Group
>>
>> Department of Plant Sciences
>> University of Cambridge
>> Downing Street
>> Cambridge CB2 3EA
>> United Kingdom
>>
>> Tel: +44 (0)1223 333 915
>> Fax: +44 (0)1223 333953
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>     
>
>