[Bioc-sig-seq] Adapter removal
Martin Morgan
mtmorgan at fhcrc.org
Thu Jul 17 18:35:12 CEST 2008
"Krys Kelly" <kak28 at cam.ac.uk> writes:
> I have inherited a pipeline for Solexa sequence data using Perl, Bioperl,
> SSAHA and mySQL. As an R/Bioconducter user I am interested in ShortRead and
> BiostringsCinterfaceDemo.
>
> However, in the short term I need to use the current pipeline. The imaging
> is done by the Sequencing Facility and we get fastq files with the 3'
> adapter still attached. The adapter removal is currently done by a Perl
> script which just keeps sequences which match any number of letters in
> [ACGT] followed by the first 8 letters of the adapter. This seems pretty
> crude (e.g. only using 8 letters, not allowing for mismatches, not allowing
> for the diminishing quality along the length of the read).
In the very latest Biostrings, and thanks to Herve's skills, you can
> library(ShortRead)
> sp <- SolexaPath("/path/to/solexa/expt")
> pat <- "s_1_0001_seq.txt" # or any regexp to read multiple files
> seq <- readFastq(analysisPath(sp), patreadXStringColumns(baseCallPath(sp), pat,
+ list(NULL, NULL, NULL, NULL, "DNAString"))[[1]]
> seq <- clean(seq) # no reads with uncalled bases
(or readFastq on analysisPath(sp), "s_1_sequence.txt", for instance,
but these are 'filtered' and it's not really clear that that is the
right place for a careful analysis to start), and then
> tables(seq)$top[1:5]
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGAT
186 126
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAT
71 51
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAA
16
to remind me what the primer sequence looks like (!) and
> pd <- PDict(seq, max.mismatch=3)
> subj <- DNAString("GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA")
> cnts <- countPDict(pd, subj, max.mismatch=3)
cnts now contains the number of times each read in seq matched the
primer with up to 3 mismatches (could be more, within reason), of
which there are
> sum(cnts!=0)
[1] 565
seq[cnts==0] gets me the non-primer sequences (though apparently this
is not entirely satisfactory, see the second entry!)
> tables(seq[cnts==0])$top[1:5]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GATCGGAAGAGCTCGGTATGCCGTCTTCTGCTTAGA
71 12
GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG
5 5
CACCTTTTTCAGTTTTCCTCGCCATATTTCACGTCC
4
sessionInfo() tells me
other attached packages:
[1] ShortRead_0.1.32 lattice_0.17-10 Biostrings_2.9.42 Biobase_2.1.7
> Google has not revealed any algorithms or code for this part of the
> pipeline. Does anyone know what algorithms are being used or, even better,
> could anyone point me in the direction of some code?
I guess you're asking in general. In terms of ELAND, I'm not really
sure what it does about primers; I think it actually keeps them in,
relying on failure to align to weed them out. With contaminants (which
can be specified in a separate file) ELAND aligns the reads to the
contaminants and discards matches. The code is 'Solexa Open Source' or
some such, and should be available from your friendly Solexa rep; the
contaminant stuff is in Gerald/GERALD.pl of the version of the code I
have access to.
There are some obvious additional problems, e.g., those poly-A
reads. Tools in Biostrings (e.g., alphabetFrequency) can be used in
these cases, too.
All of this seems pretty ad hoc, which is not really what you were
looking for, I suspect.
Martin
> Thanks
>
> Krys
>
>
> Dr Krystyna A Kelly
> Senior Research Associate
> David Baulcombe Group
>
> Department of Plant Sciences
> University of Cambridge
> Downing Street
> Cambridge CB2 3EA
> United Kingdom
>
> Tel: +44 (0)1223 333 915
> Fax: +44 (0)1223 333953
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793
More information about the Bioc-sig-sequencing
mailing list