[Bioc-sig-seq] adapter removal

Martin Morgan mtmorgan at fhcrc.org
Fri Jan 9 00:04:40 CET 2009


Hi Jim --

"James W. MacDonald" <jmacdon at med.umich.edu> writes:

> As a follow-up question, once you have removed the adapter sequences
> is it then possible to write the ShortReadQ object back out in a fastq
> format for aligning with e.g., bowtie? I know I can use
> write.XstringSet() to output in fasta format, but I would like to feed
> bowtie the quality scores as well.

Actually a 'writeFastq' is on my 'do this week' list. In the mean time
my best (untested) suggestion is along the lines of

n <- length(fq)
lines <- character(n)
lines[(seq_len(n)) * 4 - 3] <- 
   paste("@", as.character(id(fq)), sep="")
lines[(seq_len(n)) * 4 - 2] <- 
   as.character(sread(rfq1))
lines[(seq_len(n)) * 4 - 1] <- 
    paste("+", as.character(id(fq)), sep="")
lines[(seq_len(n)) * 4] <- 
    as.character(quality(quality(fq)))
writeLines(lines, "my.fastq")

This will be unnecessarily memory intensive. 

> Or is the speed reasonably comparable to use the PDict functions in
> Biostrings to do the aligning?

Bowtie is very fast. One potential 'gotcha' is the orientation of
reads throughout the workflow -- Eland _sequence.txt files are the
reads as they come from the machine; MAQ (and I think Bowtie) aligned
files report reads that align to the '-' strand in their reverse
complement sequence, with qualities reversed to match the reverse
compelement read, and with the alignment position recorded at the
'left-most' (rather than 5', say) end. Not that it will be a problem
in the current use case, just something to be aware of as you move
between tools.

Martin

> Best,
>
> Jim
>
>
>
> Martin Morgan wrote:
>> "David A.G" <dasolexa at hotmail.com> writes:
>> 
>>> Dear list,
>>>
>>> I have some experience with Bioconductor but am newbie to this list and to NGS. I am trying to remove some adapters from my solexa s_N_sequence.txt file using Biostrings and ShortRead packages and the vignettes.  I managed to read in the text file and got to save the reads as follows
>>>
>>> fqpattern <- "s_4_sequence.txt"
>>> f4 <- file.path(analysisPath(sp), fqpattern)
>>> fq4 <- readFastq(sp, fqpattern)
>>> reads <- sread(fq4)  #"reads" contains more than 4 million 34-length fragments
>>>
>>> Having the following adapter sequence:
>>>
>>> adapter <- DNAString("ACGGATTGTTCAGT")
>>>
>>> I tried to mimic the example in the Biostring vignette as follows:
>>>
>>>
>>> myAdapterAligns <- pairwiseAlignment(reads, adapter, type = "overlap")
>>>
>>> but after more than two hours the process is still running.
>>>
>>> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread on adapter removal but does not clear things much to me, since as far as I understood the option mentioned in the thread is not appropriate (quote :(though apparently this is not entirely satisfactory, see the second entry!)).
>>>
>>> Is this just a memory issue or am I doing something wrong? Shall I leave the process to run for longer?
>> As a bit of a follow-up, a reasonable strategy is to figure out how
>> long / how much memory the calculation takes on smaller chunks of
>> data.
>> Also, here I 'clean' (remove reads with 'N's) and then calculate the
>> srdistance to your adapter on a lane with 5978132 reads in about a
>> minute
>> 
>>> fq <- readFastq(sp, "s_1_sequence.txt")
>>> fq
>> class: ShortReadQ
>> length: 5978132 reads; width: 53 cycles
>>> system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))
>>    user  system elapsed  60.960   0.000  60.961 I might then
>> tabulate the distances
>> 
>>> table(dist[[1]])
>> and subset fq based on some criterion
>> fq[dist[[1]] > 4]
>> There's a second interface to this, too, using filters, e.g.,
>> fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]
>> Martin
>> 
>>> TIA for your help,
>>>
>>> Dave
>>>
>>> _________________________________________________________________
>>> Show them the way! Add maps and directions to your party invites. 
>>>
>>> 	[[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> 
>
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> Hildebrandt Lab
> 8220D MSRB III
> 1150 W. Medical Center Drive
> Ann Arbor MI 48109-5646
> 734-936-8662

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list