[Bioc-sig-seq] adapter removal
Patrick Aboyoun
paboyoun at fhcrc.org
Mon Jan 19 02:59:41 CET 2009
Kasper,
Yes, but there is between 12 - 36 delay between an svn checkin and a
package being available at bioconductor.org.
Patrick
Quoting Kasper Daniel Hansen <khansen at stat.berkeley.edu>:
> Shouldn't biocLite pick up recent additions to the subversion
> repository, provided that you are using R-devel and you install using
> pkgType = "source"?
>
> Kasper
>
> On Jan 17, 2009, at 19:24 , Patrick Aboyoun wrote:
>
>> Joe,
>> I have been making some modifications to trimLRPatterns both today
>> and in recent days, so you may need to get the latest version of
>> Biostrings directly from svn rather than using biocLite from within
>> R. Once you have a recently sufficient version, the key is in the
>> construction of the max.Rmismatch argument. Below are some examples
>> they achieve the result you are looking for. The man page for
>> trimLRPatterns has a detailed description on various types of
>> inputs that are accepted by the max.Rmismatch argument.
>>
>>
>>> suppressMessages(library(Biostrings))
>>> Rpattern <- "CTGTAGGCACCA"
>>> subjectSet <-
>> + DNAStringSet(c("GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA",
>> + "GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC"))
>>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
>> + max.Rmismatch = rep(2, 12))
>> A DNAStringSet instance of length 2
>> width seq
>> [1] 22 GCTGGAACCCAGGGTGTTGTAC
>> [2] 24 GTAAGACCATACTTGGCCGAATGC
>>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
>> + max.Rmismatch = 0.2)
>> A DNAStringSet instance of length 2
>> width seq
>> [1] 22 GCTGGAACCCAGGGTGTTGTAC
>> [2] 24 GTAAGACCATACTTGGCCGAATGC
>>> sessionInfo()
>> R version 2.9.0 Under development (unstable) (2009-01-15 r47619)
>> i386-apple-darwin9.6.0
>>
>> locale:
>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] Biostrings_2.11.25 IRanges_1.1.34
>>
>> loaded via a namespace (and not attached):
>> [1] grid_2.9.0 lattice_0.17-20 Matrix_0.999375-17
>>
>>
>> Patrick
>>
>>
>> Quoting joseph franklin <joseph.franklin at yale.edu>:
>>
>>> Patrick,
>>>
>>> This adapter tool looks extremely useful for my purposes: removing
>>> adapters from smRNA reads to estimate the short template lengths.
>>> Forgive me if the answer to this is obvious, but everything seems to
>>> work with trimLRPatterns, except that it doesn't seem to allow the
>>> Rpattern or Lpattern to slide along the sequence (at least using the
>>> default settings--see below). Rather it looks only for exact matches,
>>> that leave no overhang. Thus:
>>>
>>>> Rpattern <- "CTGTAGGCACCA"
>>>
>>> trims:
>>>
>>> [6] 34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA
>>>
>>> nicely, to:
>>>
>>> [6] 22 GCTGGAACCCAGGGTGTTGTAC
>>>
>>>
>>> but a sequence where resulting in an Rpattern overhang (here ~2nt):
>>>
>>> [90] 34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>>
>>> is not trimmed at all:
>>>
>>> [90] 34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>> :
>>>
>>> What can I do to allow for flexibility at the overhanging end?
>>>
>>>
>>> Again, thanks very much.
>>> Joe
>>>
>>>
>>> On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:
>>>
>>> I just checked in a trimLRPatterns function to the Bioconductor svn
>>> repository for BioC 2.4. Its signature is
>>>
>>> trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
>>> max.Lmismatch = 0, max.Rmismatch = 0,
>>> with.Lindels = FALSE, with.Rindels = FALSE,
>>> Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)
>>>
>>> As you can infer from the arguments, this function allows the user to
>>> set the # of mismatches (if with.*indels = FALSE) / edit distance (if
>>> with.*indels = TRUE) for the left and right flanking "patterns". It
>>> also allows for IUPAC ambiguity letters in these flanking regions if
>>> *fixed = FALSE. When ranges = FALSE, trimLRPatterns returns the trimmed
>>> strings. When ranges = TRUE, it returns the ranges that you can use to
>>> trim the strings. Here are some examples:
>>>
>>>> Lpattern <- "TTCTGCTTG"
>>>> Rpattern <- "GATCGGAAG"
>>>> subject <- DNAString("TTCTGCTTGACGTGATCGGA")
>>>> subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG", "TTCTGCTTGGATCGGAAG"))
>>>> trimLRPatterns(Lpattern = Lpattern, subject = subject)
>>> 11-letter "DNAString" instance
>>> seq: ACGTGATCGGA
>>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>>> subjectSet)
>>> A DNAStringSet instance of length 2
>>> width seq
>>> [1] 18 TGCTTGACGGCAGATCGG
>>> [2] 0
>>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>>> subjectSet,
>>> + ranges = TRUE)
>>> IRanges object:
>>> start end width
>>> 1 1 18 18
>>> 2 10 9 0
>>>
>>> This functionality will be available on bioconductor.org (and
>>> downloadable via biocLite) in the next day or so, but you can also grab
>>> Biostrings from svn directly if you need it sooner. It will also feed
>>> its way into Biostrings documentation and training material before the
>>> next release of Bioconductor in May.
>>>
>>>
>>> Patrick
>>>
>>>
>>>
>>> Patrick Aboyoun wrote:
>>>> David,
>>>> Following up on Martin's comments, I am putting the finishing
>>>> touches on a function called trimLRPatterns for the Biostrings
>>>> package. Its purpose is to trim left and/or right flanking
>>>> patterns from sequences, so it can strip 5' and/or 3' adapters
>>>> from your reads. The signature for this function is
>>>>
>>>> trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject,
>>>> max.Lnedit=0, max.Rnedit=0,
>>>> with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE,
>>>> Rfixed=TRUE,
>>>> rangesOnly = FALSE)
>>>>
>>>> I will be checking this function into the BioC 2.4 code line,
>>>> which requires using R-devel, sometime today or tomorrow. I will
>>>> send out an e-mail to this group when I check it in and show a
>>>> simple example of its usage. I talked with Martin and he will
>>>> wrap this functionality in the ShortRead layer so you don't have
>>>> to leave the ShortRead class system when removing adapters from
>>>> your reads.
>>>>
>>>>
>>>> Cheers,
>>>> Patrick
>>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list