[Bioc-sig-seq] adapter removal
Patrick Aboyoun
paboyoun at fhcrc.org
Sun Jan 18 04:24:56 CET 2009
Joe,
I have been making some modifications to trimLRPatterns both today and
in recent days, so you may need to get the latest version of
Biostrings directly from svn rather than using biocLite from within R.
Once you have a recently sufficient version, the key is in the
construction of the max.Rmismatch argument. Below are some examples
they achieve the result you are looking for. The man page for
trimLRPatterns has a detailed description on various types of inputs
that are accepted by the max.Rmismatch argument.
> suppressMessages(library(Biostrings))
> Rpattern <- "CTGTAGGCACCA"
> subjectSet <-
+ DNAStringSet(c("GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA",
+ "GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC"))
> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
+ max.Rmismatch = rep(2, 12))
A DNAStringSet instance of length 2
width seq
[1] 22 GCTGGAACCCAGGGTGTTGTAC
[2] 24 GTAAGACCATACTTGGCCGAATGC
> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
+ max.Rmismatch = 0.2)
A DNAStringSet instance of length 2
width seq
[1] 22 GCTGGAACCCAGGGTGTTGTAC
[2] 24 GTAAGACCATACTTGGCCGAATGC
> sessionInfo()
R version 2.9.0 Under development (unstable) (2009-01-15 r47619)
i386-apple-darwin9.6.0
locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.11.25 IRanges_1.1.34
loaded via a namespace (and not attached):
[1] grid_2.9.0 lattice_0.17-20 Matrix_0.999375-17
Patrick
Quoting joseph franklin <joseph.franklin at yale.edu>:
> Patrick,
>
> This adapter tool looks extremely useful for my purposes: removing
> adapters from smRNA reads to estimate the short template lengths.
> Forgive me if the answer to this is obvious, but everything seems to
> work with trimLRPatterns, except that it doesn't seem to allow the
> Rpattern or Lpattern to slide along the sequence (at least using the
> default settings--see below). Rather it looks only for exact matches,
> that leave no overhang. Thus:
>
>> Rpattern <- "CTGTAGGCACCA"
>
> trims:
>
> [6] 34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA
>
> nicely, to:
>
> [6] 22 GCTGGAACCCAGGGTGTTGTAC
>
>
> but a sequence where resulting in an Rpattern overhang (here ~2nt):
>
> [90] 34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>
> is not trimmed at all:
>
> [90] 34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
> :
>
> What can I do to allow for flexibility at the overhanging end?
>
>
> Again, thanks very much.
> Joe
>
>
> On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:
>
> I just checked in a trimLRPatterns function to the Bioconductor svn
> repository for BioC 2.4. Its signature is
>
> trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
> max.Lmismatch = 0, max.Rmismatch = 0,
> with.Lindels = FALSE, with.Rindels = FALSE,
> Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)
>
> As you can infer from the arguments, this function allows the user to
> set the # of mismatches (if with.*indels = FALSE) / edit distance (if
> with.*indels = TRUE) for the left and right flanking "patterns". It
> also allows for IUPAC ambiguity letters in these flanking regions if
> *fixed = FALSE. When ranges = FALSE, trimLRPatterns returns the trimmed
> strings. When ranges = TRUE, it returns the ranges that you can use to
> trim the strings. Here are some examples:
>
>> Lpattern <- "TTCTGCTTG"
>> Rpattern <- "GATCGGAAG"
>> subject <- DNAString("TTCTGCTTGACGTGATCGGA")
>> subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG", "TTCTGCTTGGATCGGAAG"))
>> trimLRPatterns(Lpattern = Lpattern, subject = subject)
> 11-letter "DNAString" instance
> seq: ACGTGATCGGA
>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
> subjectSet)
> A DNAStringSet instance of length 2
> width seq
> [1] 18 TGCTTGACGGCAGATCGG
> [2] 0
>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
> subjectSet,
> + ranges = TRUE)
> IRanges object:
> start end width
> 1 1 18 18
> 2 10 9 0
>
> This functionality will be available on bioconductor.org (and
> downloadable via biocLite) in the next day or so, but you can also grab
> Biostrings from svn directly if you need it sooner. It will also feed
> its way into Biostrings documentation and training material before the
> next release of Bioconductor in May.
>
>
> Patrick
>
>
>
> Patrick Aboyoun wrote:
>> David,
>> Following up on Martin's comments, I am putting the finishing
>> touches on a function called trimLRPatterns for the Biostrings
>> package. Its purpose is to trim left and/or right flanking patterns
>> from sequences, so it can strip 5' and/or 3' adapters from your
>> reads. The signature for this function is
>>
>> trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject, max.Lnedit=0,
>> max.Rnedit=0,
>> with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE,
>> Rfixed=TRUE,
>> rangesOnly = FALSE)
>>
>> I will be checking this function into the BioC 2.4 code line, which
>> requires using R-devel, sometime today or tomorrow. I will send
>> out an e-mail to this group when I check it in and show a simple
>> example of its usage. I talked with Martin and he will wrap this
>> functionality in the ShortRead layer so you don't have to leave the
>> ShortRead class system when removing adapters from your reads.
>>
>>
>> Cheers,
>> Patrick
>>
More information about the Bioc-sig-sequencing
mailing list