[Bioc-sig-seq] adapter removal

Kasper Daniel Hansen khansen at stat.berkeley.edu
Mon Jan 19 01:23:27 CET 2009


Shouldn't biocLite pick up recent additions to the subversion  
repository, provided that you are using R-devel and you install using  
pkgType = "source"?

Kasper

On Jan 17, 2009, at 19:24 , Patrick Aboyoun wrote:

> Joe,
> I have been making some modifications to trimLRPatterns both today  
> and in recent days, so you may need to get the latest version of  
> Biostrings directly from svn rather than using biocLite from within  
> R. Once you have a recently sufficient version, the key is in the  
> construction of the max.Rmismatch argument. Below are some examples  
> they achieve the result you are looking for. The man page for  
> trimLRPatterns has a detailed description on various types of inputs  
> that are accepted by the max.Rmismatch argument.
>
>
>> suppressMessages(library(Biostrings))
>> Rpattern <- "CTGTAGGCACCA"
>> subjectSet <-
> + DNAStringSet(c("GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA",
> +                "GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC"))
>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
> +                max.Rmismatch = rep(2, 12))
>  A DNAStringSet instance of length 2
>    width seq
> [1]    22 GCTGGAACCCAGGGTGTTGTAC
> [2]    24 GTAAGACCATACTTGGCCGAATGC
>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
> +                max.Rmismatch = 0.2)
>  A DNAStringSet instance of length 2
>    width seq
> [1]    22 GCTGGAACCCAGGGTGTTGTAC
> [2]    24 GTAAGACCATACTTGGCCGAATGC
>> sessionInfo()
> R version 2.9.0 Under development (unstable) (2009-01-15 r47619)
> i386-apple-darwin9.6.0
>
> locale:
> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.11.25 IRanges_1.1.34
>
> loaded via a namespace (and not attached):
> [1] grid_2.9.0         lattice_0.17-20    Matrix_0.999375-17
>
>
> Patrick
>
>
> Quoting joseph franklin <joseph.franklin at yale.edu>:
>
>> Patrick,
>>
>> This adapter tool looks extremely useful for my purposes: removing
>> adapters from smRNA reads to estimate the short template lengths.
>> Forgive me if the answer to this is obvious, but everything seems to
>> work with trimLRPatterns, except that it doesn't seem to allow the
>> Rpattern or Lpattern to slide along the sequence (at least using the
>> default settings--see below).  Rather it looks only for exact  
>> matches,
>> that leave no overhang.  Thus:
>>
>>> Rpattern <- "CTGTAGGCACCA"
>>
>> trims:
>>
>> [6]    34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA
>>
>> nicely, to:
>>
>> [6]    22 GCTGGAACCCAGGGTGTTGTAC
>>
>>
>> but a sequence where resulting in an Rpattern overhang (here ~2nt):
>>
>> [90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>
>> is not trimmed at all:
>>
>> [90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>                                                      :
>>
>> What can I do to allow for flexibility at the overhanging end?
>>
>>
>> Again, thanks very much.
>> Joe
>>
>>
>> On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:
>>
>> I just checked in a trimLRPatterns function to the Bioconductor svn
>> repository for BioC 2.4. Its signature is
>>
>> trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
>>               max.Lmismatch = 0, max.Rmismatch = 0,
>>               with.Lindels = FALSE, with.Rindels = FALSE,
>>               Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)
>>
>> As you can infer from the arguments, this function allows the user to
>> set the # of mismatches (if with.*indels = FALSE) / edit distance (if
>> with.*indels = TRUE) for the left and right flanking "patterns". It
>> also allows for IUPAC ambiguity letters in these flanking regions if
>> *fixed = FALSE. When ranges = FALSE, trimLRPatterns returns the  
>> trimmed
>> strings. When ranges = TRUE, it returns the ranges that you can use  
>> to
>> trim the strings. Here are some examples:
>>
>>>  Lpattern <- "TTCTGCTTG"
>>>  Rpattern <- "GATCGGAAG"
>>>  subject <- DNAString("TTCTGCTTGACGTGATCGGA")
>>>  subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG",  
>>> "TTCTGCTTGGATCGGAAG"))
>>>  trimLRPatterns(Lpattern = Lpattern, subject = subject)
>> 11-letter "DNAString" instance
>> seq: ACGTGATCGGA
>>>  trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>> subjectSet)
>> A DNAStringSet instance of length 2
>>  width seq
>> [1]    18 TGCTTGACGGCAGATCGG
>> [2]     0
>>>  trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>> subjectSet,
>> +                  ranges = TRUE)
>> IRanges object:
>> start end width
>> 1     1  18    18
>> 2    10   9     0
>>
>> This functionality will be available on bioconductor.org (and
>> downloadable via biocLite) in the next day or so, but you can also  
>> grab
>> Biostrings from svn directly if you need it sooner. It will also feed
>> its way into Biostrings documentation and training material before  
>> the
>> next release of Bioconductor in May.
>>
>>
>> Patrick
>>
>>
>>
>> Patrick Aboyoun wrote:
>>> David,
>>> Following up on Martin's comments, I am putting the finishing   
>>> touches on a function called trimLRPatterns for the Biostrings   
>>> package. Its purpose is to trim left and/or right flanking  
>>> patterns  from sequences, so it can strip 5' and/or 3' adapters  
>>> from your  reads. The signature for this function is
>>>
>>> trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject,  
>>> max.Lnedit=0,  max.Rnedit=0,
>>>              with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE,   
>>> Rfixed=TRUE,
>>>              rangesOnly = FALSE)
>>>
>>> I will be checking this function into the BioC 2.4 code line,  
>>> which  requires using R-devel, sometime today or tomorrow. I will  
>>> send out  an e-mail to this group when I check it in and show a  
>>> simple  example of its usage. I talked with Martin and he will  
>>> wrap this  functionality in the ShortRead layer so you don't have  
>>> to leave the  ShortRead class system when removing adapters from  
>>> your reads.
>>>
>>>
>>> Cheers,
>>> Patrick
>>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



More information about the Bioc-sig-sequencing mailing list