[Bioc-sig-seq] adapter removal

Thu Jan 8 18:25:09 CET 2009

Dave,
Before you look for an external solution, I recommend you try the 
vcountPattern function in Biostrings. In a training course back in Nov 
08 we showed the efficiency of vcountPattern in finding adapter-like reads.

http://bioconductor.org/workshops/2008/SeattleNov08/MatchAlign/MatchAlign.pdf

see pages 7 - 10

As a first approximation, I would guess your code would look something like:

> adapter <- DNAString("ACGGATTGTTCAGT")
> prefix <- substring(myReads, 1, nchar(adapter))
> suffix <- substring(myReads, nchar(myReads) - nchar(adapter) + 1, nchar(myReads))
> whichAdapters <- which(vcountPattern(adapter, prefix, max.mismatch = 1) +
                         vcountPattern(adapter, suffix, max.mismatch = 1) > 0)
> nonAdapterReads <- myReads[- whichAdapters]
> adapterReads <- myReads[whichAdapters]
> adapterReads

Patrick

Dan Bolser wrote:
> 2009/1/8 David A.G <dasolexa at hotmail.com>:
>   
>> Dear list,
>>
>> I have some experience with Bioconductor but am newbie to this list and to NGS. I am trying to remove some adapters from my solexa s_N_sequence.txt file using Biostrings and ShortRead packages and the vignettes.  I managed to read in the text file and got to save the reads as follows
>>
>> fqpattern <- "s_4_sequence.txt"
>> f4 <- file.path(analysisPath(sp), fqpattern)
>> fq4 <- readFastq(sp, fqpattern)
>> reads <- sread(fq4)  #"reads" contains more than 4 million 34-length fragments
>>
>> Having the following adapter sequence:
>>
>> adapter <- DNAString("ACGGATTGTTCAGT")
>>
>> I tried to mimic the example in the Biostring vignette as follows:
>>
>>
>> myAdapterAligns <- pairwiseAlignment(reads, adapter, type = "overlap")
>>
>> but after more than two hours the process is still running.
>>
>> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread on adapter removal but does not clear things much to me, since as far as I understood the option mentioned in the thread is not appropriate (quote :(though apparently this is not entirely satisfactory, see the second entry!)).
>>
>> Is this just a memory issue or am I doing something wrong? Shall I leave the process to run for longer?
>>
>> TIA for your help,
>>
>> Dave
>>     
>
> Hi Dave
>
> I think a stand alone C program may be more appropriate for the task
> you are trying to perform. I'm new to NGS myself, but I believe there
> are many software available to do this. I think the convenience of
> using R natrualy results in a performance hit on some intensive
> algorithms.
>
> Try asking your question over here:
>
> http://seqanswers.com/
>
>
> or is there a better mailing list?
>
> Cheers,
>
> Dan.
>
>   
>> _________________________________________________________________
>> Show them the way! Add maps and directions to your party invites.
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>     
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>