[Bioc-sig-seq] adapter removal
Kasper Daniel Hansen
khansen at stat.berkeley.edu
Mon Jan 19 01:23:27 CET 2009
Shouldn't biocLite pick up recent additions to the subversion
repository, provided that you are using R-devel and you install using
pkgType = "source"?
Kasper
On Jan 17, 2009, at 19:24 , Patrick Aboyoun wrote:
> Joe,
> I have been making some modifications to trimLRPatterns both today
> and in recent days, so you may need to get the latest version of
> Biostrings directly from svn rather than using biocLite from within
> R. Once you have a recently sufficient version, the key is in the
> construction of the max.Rmismatch argument. Below are some examples
> they achieve the result you are looking for. The man page for
> trimLRPatterns has a detailed description on various types of inputs
> that are accepted by the max.Rmismatch argument.
>
>
>> suppressMessages(library(Biostrings))
>> Rpattern <- "CTGTAGGCACCA"
>> subjectSet <-
> + DNAStringSet(c("GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA",
> + "GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC"))
>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
> + max.Rmismatch = rep(2, 12))
> A DNAStringSet instance of length 2
> width seq
> [1] 22 GCTGGAACCCAGGGTGTTGTAC
> [2] 24 GTAAGACCATACTTGGCCGAATGC
>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
> + max.Rmismatch = 0.2)
> A DNAStringSet instance of length 2
> width seq
> [1] 22 GCTGGAACCCAGGGTGTTGTAC
> [2] 24 GTAAGACCATACTTGGCCGAATGC
>> sessionInfo()
> R version 2.9.0 Under development (unstable) (2009-01-15 r47619)
> i386-apple-darwin9.6.0
>
> locale:
> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.11.25 IRanges_1.1.34
>
> loaded via a namespace (and not attached):
> [1] grid_2.9.0 lattice_0.17-20 Matrix_0.999375-17
>
>
> Patrick
>
>
> Quoting joseph franklin <joseph.franklin at yale.edu>:
>
>> Patrick,
>>
>> This adapter tool looks extremely useful for my purposes: removing
>> adapters from smRNA reads to estimate the short template lengths.
>> Forgive me if the answer to this is obvious, but everything seems to
>> work with trimLRPatterns, except that it doesn't seem to allow the
>> Rpattern or Lpattern to slide along the sequence (at least using the
>> default settings--see below). Rather it looks only for exact
>> matches,
>> that leave no overhang. Thus:
>>
>>> Rpattern <- "CTGTAGGCACCA"
>>
>> trims:
>>
>> [6] 34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA
>>
>> nicely, to:
>>
>> [6] 22 GCTGGAACCCAGGGTGTTGTAC
>>
>>
>> but a sequence where resulting in an Rpattern overhang (here ~2nt):
>>
>> [90] 34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>
>> is not trimmed at all:
>>
>> [90] 34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>> :
>>
>> What can I do to allow for flexibility at the overhanging end?
>>
>>
>> Again, thanks very much.
>> Joe
>>
>>
>> On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:
>>
>> I just checked in a trimLRPatterns function to the Bioconductor svn
>> repository for BioC 2.4. Its signature is
>>
>> trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
>> max.Lmismatch = 0, max.Rmismatch = 0,
>> with.Lindels = FALSE, with.Rindels = FALSE,
>> Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)
>>
>> As you can infer from the arguments, this function allows the user to
>> set the # of mismatches (if with.*indels = FALSE) / edit distance (if
>> with.*indels = TRUE) for the left and right flanking "patterns". It
>> also allows for IUPAC ambiguity letters in these flanking regions if
>> *fixed = FALSE. When ranges = FALSE, trimLRPatterns returns the
>> trimmed
>> strings. When ranges = TRUE, it returns the ranges that you can use
>> to
>> trim the strings. Here are some examples:
>>
>>> Lpattern <- "TTCTGCTTG"
>>> Rpattern <- "GATCGGAAG"
>>> subject <- DNAString("TTCTGCTTGACGTGATCGGA")
>>> subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG",
>>> "TTCTGCTTGGATCGGAAG"))
>>> trimLRPatterns(Lpattern = Lpattern, subject = subject)
>> 11-letter "DNAString" instance
>> seq: ACGTGATCGGA
>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>> subjectSet)
>> A DNAStringSet instance of length 2
>> width seq
>> [1] 18 TGCTTGACGGCAGATCGG
>> [2] 0
>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>> subjectSet,
>> + ranges = TRUE)
>> IRanges object:
>> start end width
>> 1 1 18 18
>> 2 10 9 0
>>
>> This functionality will be available on bioconductor.org (and
>> downloadable via biocLite) in the next day or so, but you can also
>> grab
>> Biostrings from svn directly if you need it sooner. It will also feed
>> its way into Biostrings documentation and training material before
>> the
>> next release of Bioconductor in May.
>>
>>
>> Patrick
>>
>>
>>
>> Patrick Aboyoun wrote:
>>> David,
>>> Following up on Martin's comments, I am putting the finishing
>>> touches on a function called trimLRPatterns for the Biostrings
>>> package. Its purpose is to trim left and/or right flanking
>>> patterns from sequences, so it can strip 5' and/or 3' adapters
>>> from your reads. The signature for this function is
>>>
>>> trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject,
>>> max.Lnedit=0, max.Rnedit=0,
>>> with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE,
>>> Rfixed=TRUE,
>>> rangesOnly = FALSE)
>>>
>>> I will be checking this function into the BioC 2.4 code line,
>>> which requires using R-devel, sometime today or tomorrow. I will
>>> send out an e-mail to this group when I check it in and show a
>>> simple example of its usage. I talked with Martin and he will
>>> wrap this functionality in the ShortRead layer so you don't have
>>> to leave the ShortRead class system when removing adapters from
>>> your reads.
>>>
>>>
>>> Cheers,
>>> Patrick
>>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list