[Bioc-sig-seq] adapter removal
Martin Morgan
mtmorgan at fhcrc.org
Sat Jan 31 05:29:39 CET 2009
Victor Ruotti <ruotti at wisc.edu> writes:
> Maybe a not a straight topic related thing to adapter removal...
> However, do you guys have a simple way to split the fastq files once
> loaded into biostrings?
you can just subset the object you've read in.
> library(ShortRead)
> example(readFastq)
[snip]
> rfq1
class: ShortReadQ
length: 256 reads; width: 36 cycles
> rfq[1:20]
class: ShortReadQ
length: 20 reads; width: 36 cycles
> writeFastq(rfq[1:2], file="/tmp/firstFive.fastq")
> readLines("/tmp/firstFive.fastq")
[1] "@HWI-EAS88_1_1_1_1001_499"
[2] "GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT"
[3] "+HWI-EAS88_1_1_1_1001_499"
[4] "]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS"
[5] "@HWI-EAS88_1_1_1_898_392"
[6] "GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC"
[7] "+HWI-EAS88_1_1_1_898_392"
[8] "]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK"
standard R code could be used to identify five more-or-less equal
chunks (e.g., using as.integer(cut(seq_len(length(rfq)), nchunks))) or
other operations.
Was that what you were thinking of?
Martin
> Say you load a whole lane and now want to split 12M reads and their
> qualities into multiple fastq files from processing with Maq, gmap,
> etc?
> Perhaps this should be done outside or R? Just wanted to see what you
> think...
> See below
> ...
>
> From the Grid Engine Life Sciense SIG
>
> Can you share the program you use to split the reads?
> I'm the process of writing a fasta/fastq/seq/prb splitter. I use Perl
> and thought about starting a bioperl module for next gen stuff.
> They already have a bunch of modules to deal with qualities, so I was
> thinking adding a method to split fastq files for maq/gmap processing
> would be something good to have in bioperl. Great work had also being
> done with biostrings and maybe Martin can comment on this.
>
> As simple as it might sound it would be good to have bioperl and maybe
> biostrings setup for this.
> Will try to post this in the biostrings forum as well.
> Any thoughts/interest on this?
>
> Victor
>
>
> On Jan 18, 2009, at 7:59 PM, Patrick Aboyoun wrote:
>
>> Kasper,
>> Yes, but there is between 12 - 36 delay between an svn checkin and a
>> package being available at bioconductor.org.
>>
>>
>> Patrick
>>
>>
>> Quoting Kasper Daniel Hansen <khansen at stat.berkeley.edu>:
>>
>>> Shouldn't biocLite pick up recent additions to the subversion
>>> repository, provided that you are using R-devel and you install using
>>> pkgType = "source"?
>>>
>>> Kasper
>>>
>>> On Jan 17, 2009, at 19:24 , Patrick Aboyoun wrote:
>>>
>>>> Joe,
>>>> I have been making some modifications to trimLRPatterns both
>>>> today and in recent days, so you may need to get the latest
>>>> version of Biostrings directly from svn rather than using
>>>> biocLite from within R. Once you have a recently sufficient
>>>> version, the key is in the construction of the max.Rmismatch
>>>> argument. Below are some examples they achieve the result you are
>>>> looking for. The man page for trimLRPatterns has a detailed
>>>> description on various types of inputs that are accepted by the
>>>> max.Rmismatch argument.
>>>>
>>>>
>>>>> suppressMessages(library(Biostrings))
>>>>> Rpattern <- "CTGTAGGCACCA"
>>>>> subjectSet <-
>>>> + DNAStringSet(c("GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA",
>>>> + "GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC"))
>>>>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
>>>> + max.Rmismatch = rep(2, 12))
>>>> A DNAStringSet instance of length 2
>>>> width seq
>>>> [1] 22 GCTGGAACCCAGGGTGTTGTAC
>>>> [2] 24 GTAAGACCATACTTGGCCGAATGC
>>>>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
>>>> + max.Rmismatch = 0.2)
>>>> A DNAStringSet instance of length 2
>>>> width seq
>>>> [1] 22 GCTGGAACCCAGGGTGTTGTAC
>>>> [2] 24 GTAAGACCATACTTGGCCGAATGC
>>>>> sessionInfo()
>>>> R version 2.9.0 Under development (unstable) (2009-01-15 r47619)
>>>> i386-apple-darwin9.6.0
>>>>
>>>> locale:
>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> other attached packages:
>>>> [1] Biostrings_2.11.25 IRanges_1.1.34
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] grid_2.9.0 lattice_0.17-20 Matrix_0.999375-17
>>>>
>>>>
>>>> Patrick
>>>>
>>>>
>>>> Quoting joseph franklin <joseph.franklin at yale.edu>:
>>>>
>>>>> Patrick,
>>>>>
>>>>> This adapter tool looks extremely useful for my purposes: removing
>>>>> adapters from smRNA reads to estimate the short template lengths.
>>>>> Forgive me if the answer to this is obvious, but everything seems
>>>>> to
>>>>> work with trimLRPatterns, except that it doesn't seem to allow the
>>>>> Rpattern or Lpattern to slide along the sequence (at least using
>>>>> the
>>>>> default settings--see below). Rather it looks only for exact
>>>>> matches,
>>>>> that leave no overhang. Thus:
>>>>>
>>>>>> Rpattern <- "CTGTAGGCACCA"
>>>>>
>>>>> trims:
>>>>>
>>>>> [6] 34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA
>>>>>
>>>>> nicely, to:
>>>>>
>>>>> [6] 22 GCTGGAACCCAGGGTGTTGTAC
>>>>>
>>>>>
>>>>> but a sequence where resulting in an Rpattern overhang (here ~2nt):
>>>>>
>>>>> [90] 34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>>>>
>>>>> is not trimmed at all:
>>>>>
>>>>> [90] 34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>>>> :
>>>>>
>>>>> What can I do to allow for flexibility at the overhanging end?
>>>>>
>>>>>
>>>>> Again, thanks very much.
>>>>> Joe
>>>>>
>>>>>
>>>>> On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:
>>>>>
>>>>> I just checked in a trimLRPatterns function to the Bioconductor svn
>>>>> repository for BioC 2.4. Its signature is
>>>>>
>>>>> trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
>>>>> max.Lmismatch = 0, max.Rmismatch = 0,
>>>>> with.Lindels = FALSE, with.Rindels = FALSE,
>>>>> Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)
>>>>>
>>>>> As you can infer from the arguments, this function allows the
>>>>> user to
>>>>> set the # of mismatches (if with.*indels = FALSE) / edit distance
>>>>> (if
>>>>> with.*indels = TRUE) for the left and right flanking "patterns". It
>>>>> also allows for IUPAC ambiguity letters in these flanking regions
>>>>> if
>>>>> *fixed = FALSE. When ranges = FALSE, trimLRPatterns returns the
>>>>> trimmed
>>>>> strings. When ranges = TRUE, it returns the ranges that you can
>>>>> use to
>>>>> trim the strings. Here are some examples:
>>>>>
>>>>>> Lpattern <- "TTCTGCTTG"
>>>>>> Rpattern <- "GATCGGAAG"
>>>>>> subject <- DNAString("TTCTGCTTGACGTGATCGGA")
>>>>>> subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG",
>>>>>> "TTCTGCTTGGATCGGAAG"))
>>>>>> trimLRPatterns(Lpattern = Lpattern, subject = subject)
>>>>> 11-letter "DNAString" instance
>>>>> seq: ACGTGATCGGA
>>>>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>>>>> subjectSet)
>>>>> A DNAStringSet instance of length 2
>>>>> width seq
>>>>> [1] 18 TGCTTGACGGCAGATCGG
>>>>> [2] 0
>>>>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>>>>> subjectSet,
>>>>> + ranges = TRUE)
>>>>> IRanges object:
>>>>> start end width
>>>>> 1 1 18 18
>>>>> 2 10 9 0
>>>>>
>>>>> This functionality will be available on bioconductor.org (and
>>>>> downloadable via biocLite) in the next day or so, but you can
>>>>> also grab
>>>>> Biostrings from svn directly if you need it sooner. It will also
>>>>> feed
>>>>> its way into Biostrings documentation and training material
>>>>> before the
>>>>> next release of Bioconductor in May.
>>>>>
>>>>>
>>>>> Patrick
>>>>>
>>>>>
>>>>>
>>>>> Patrick Aboyoun wrote:
>>>>>> David,
>>>>>> Following up on Martin's comments, I am putting the finishing
>>>>>> touches on a function called trimLRPatterns for the Biostrings
>>>>>> package. Its purpose is to trim left and/or right flanking
>>>>>> patterns from sequences, so it can strip 5' and/or 3' adapters
>>>>>> from your reads. The signature for this function is
>>>>>>
>>>>>> trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject,
>>>>>> max.Lnedit=0, max.Rnedit=0,
>>>>>> with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE,
>>>>>> Rfixed=TRUE,
>>>>>> rangesOnly = FALSE)
>>>>>>
>>>>>> I will be checking this function into the BioC 2.4 code line,
>>>>>> which requires using R-devel, sometime today or tomorrow. I
>>>>>> will send out an e-mail to this group when I check it in and
>>>>>> show a simple example of its usage. I talked with Martin and
>>>>>> he will wrap this functionality in the ShortRead layer so you
>>>>>> don't have to leave the ShortRead class system when removing
>>>>>> adapters from your reads.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Patrick
>>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793
More information about the Bioc-sig-sequencing
mailing list