[Bioc-sig-seq] Adapter removal
Patrick Aboyoun
paboyoun at fhcrc.org
Thu Jul 17 21:38:57 CEST 2008
Krys,
Building up on Martin's suggestion, I also recommend you take a look at
the new pairwiseAlignment function in the development version of
Biostrings. Since it uses an O(N^2) dynamic programming algorithm it
isn't necessarily suited for genome wide analysis, but it is perfectly
suited for aligning short strings and filtered out short reads by
adapter strings. The pairwiseAlignment function can fit global, local,
and ends-free (overlap) alignments with or without quality-based
substitution penalties and with or without indels.
Below is a sample script run that uses pairwiseAlignment to filter out
short reads based on the adapter. This script looks at four different
prefix cutoff levels (8, 16, 24, 36) that you could use to filter your
data and presents the results you would get when you use the equivalent
to a perfect match to the first 8 characters as your cutoff.
Patrick
> library(ShortRead)
> sp <- SolexaPath("/path/to/solexa/expt")
> pat <- "s_1_sequence.txt"
> shortReads <- readFastq(analysisPath(sp), pattern = pat)
> cleanedShortReads <- clean(shortReads) # no reads with uncalled bases
> cleanedShortReads
class: ShortReadQ
length: 3506447 reads; width: 35 cycles
> adapter <- DNAString("GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA")
> system.time({
+ adapterScores <-
+ pairwiseAlignment(pattern = sread(cleanedShortReads), subject =
adapter,
+ patternQuality =
quality(quality(cleanedShortReads)),
+ subjectQuality = 99L,
+ qualityType = "Solexa",
+ type = "overlap",
+ scoreOnly = TRUE)
+ })
user system elapsed
92.834 2.040 94.873
> firstNChar <- c(8, 16, 24, 32)
> cutoffScores <- c(0, 0, 0, 0)
> names(cutoffScores) <- firstNChar
> for (i in seq_len(length(cutoffScores))) {
+ print(system.time({
+ cutoffScores[i] <-
+ max(pairwiseAlignment(pattern = sread(cleanedShortReads),
+ subject = substring(adapter, 1, firstNChar[i]),
+ patternQuality =
quality(quality(cleanedShortReads)),
+ subjectQuality = 99L,
+ qualityType = "Solexa",
+ type = "overlap",
+ scoreOnly = TRUE))
+ }))
+ }
user system elapsed
16.901 0.480 17.381
user system elapsed
36.886 0.804 37.687
user system elapsed
58.692 1.412 60.117
user system elapsed
75.805 1.812 77.617
> cutoffScores
8 16 24 32
15.96356 31.92712 47.89068 63.82726
> quantile(adapterScores, seq(0.9, 1, 0.005))
90% 90.5% 91% 91.5% 92% 92.5%
93% 93.5%
9.909755 11.882821 13.878695 17.341880 21.739067 35.019755 50.056933
55.612625
94% 94.5% 95% 95.5% 96% 96.5%
97% 97.5%
57.737140 58.951218 59.979969 60.780243 61.931897 62.586998 63.861232
65.047516
98% 98.5% 99% 99.5% 100%
66.623512 68.858673 69.453423 69.661290 69.773109
> t(do.call("rbind", lapply(cutoffScores, function(x)
table(adapterScores > x))))
8 16 24 32
FALSE 3206960 3240999 3257733 3400456
TRUE 299487 265448 248714 105991
> tables(cleanedShortReads[adapterScores >= cutoffScores["8"]], n =
10)[["top"]]
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGA
63558 62264
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTGA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGA
17832 15939
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTATA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTAA
11554 7391
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGTA
6452 6343
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTTA GATCGGAAGAGCTCGTATGACGTCTTCTGCTTAGA
5724 3878
> sessionInfo()
R version 2.8.0 Under development (unstable) (2008-07-07 r46041)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
attached base packages:
[1] tools stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ShortRead_0.1.32 lattice_0.17-10 Biostrings_2.9.42 Biobase_2.1.7
loaded via a namespace (and not attached):
[1] grid_2.8.0
Martin Morgan wrote:
> "Krys Kelly" <kak28 at cam.ac.uk> writes:
>
>
>> I have inherited a pipeline for Solexa sequence data using Perl, Bioperl,
>> SSAHA and mySQL. As an R/Bioconducter user I am interested in ShortRead and
>> BiostringsCinterfaceDemo.
>>
>> However, in the short term I need to use the current pipeline. The imaging
>> is done by the Sequencing Facility and we get fastq files with the 3'
>> adapter still attached. The adapter removal is currently done by a Perl
>> script which just keeps sequences which match any number of letters in
>> [ACGT] followed by the first 8 letters of the adapter. This seems pretty
>> crude (e.g. only using 8 letters, not allowing for mismatches, not allowing
>> for the diminishing quality along the length of the read).
>>
>
> In the very latest Biostrings, and thanks to Herve's skills, you can
>
>
>> library(ShortRead)
>> sp <- SolexaPath("/path/to/solexa/expt")
>> pat <- "s_1_0001_seq.txt" # or any regexp to read multiple files
>> seq <- readFastq(analysisPath(sp), patreadXStringColumns(baseCallPath(sp), pat,
>>
> + list(NULL, NULL, NULL, NULL, "DNAString"))[[1]]
>
>> seq <- clean(seq) # no reads with uncalled bases
>>
>
> (or readFastq on analysisPath(sp), "s_1_sequence.txt", for instance,
> but these are 'filtered' and it's not really clear that that is the
> right place for a careful analysis to start), and then
>
>
>> tables(seq)$top[1:5]
>>
> GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGAT
> 186 126
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAT
> 71 51
> GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAA
> 16
> to remind me what the primer sequence looks like (!) and
>
>
>> pd <- PDict(seq, max.mismatch=3)
>> subj <- DNAString("GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA")
>> cnts <- countPDict(pd, subj, max.mismatch=3)
>>
>
> cnts now contains the number of times each read in seq matched the
> primer with up to 3 mismatches (could be more, within reason), of
> which there are
>
>
>> sum(cnts!=0)
>>
> [1] 565
>
> seq[cnts==0] gets me the non-primer sequences (though apparently this
> is not entirely satisfactory, see the second entry!)
>
>
>> tables(seq[cnts==0])$top[1:5]
>>
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GATCGGAAGAGCTCGGTATGCCGTCTTCTGCTTAGA
> 71 12
> GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG
> 5 5
> CACCTTTTTCAGTTTTCCTCGCCATATTTCACGTCC
> 4
>
> sessionInfo() tells me
>
> other attached packages:
> [1] ShortRead_0.1.32 lattice_0.17-10 Biostrings_2.9.42 Biobase_2.1.7
>
>
>> Google has not revealed any algorithms or code for this part of the
>> pipeline. Does anyone know what algorithms are being used or, even better,
>> could anyone point me in the direction of some code?
>>
>
> I guess you're asking in general. In terms of ELAND, I'm not really
> sure what it does about primers; I think it actually keeps them in,
> relying on failure to align to weed them out. With contaminants (which
> can be specified in a separate file) ELAND aligns the reads to the
> contaminants and discards matches. The code is 'Solexa Open Source' or
> some such, and should be available from your friendly Solexa rep; the
> contaminant stuff is in Gerald/GERALD.pl of the version of the code I
> have access to.
>
> There are some obvious additional problems, e.g., those poly-A
> reads. Tools in Biostrings (e.g., alphabetFrequency) can be used in
> these cases, too.
>
> All of this seems pretty ad hoc, which is not really what you were
> looking for, I suspect.
>
> Martin
>
>
>> Thanks
>>
>> Krys
>>
>>
>> Dr Krystyna A Kelly
>> Senior Research Associate
>> David Baulcombe Group
>>
>> Department of Plant Sciences
>> University of Cambridge
>> Downing Street
>> Cambridge CB2 3EA
>> United Kingdom
>>
>> Tel: +44 (0)1223 333 915
>> Fax: +44 (0)1223 333953
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
>
More information about the Bioc-sig-sequencing
mailing list