[Bioc-sig-seq] adapter removal

Fri Jan 9 00:12:01 CET 2009

Here's another "quick and dirty" ShortReadQ-fastq exporter. It's very slow...

exportFastaQ <- function(shortReadQObject, fileName) {
    ## dump ShortReadQ object data to FASTQ format
    entries <- length(shortReadQObject)
    i <- 1
    while (i <= entries){
        cat(append("@", as.character(id(shortReadQObject)[[i]])), sep="",
file=fileName, append=TRUE)
        cat("\n", file=fileName, append=TRUE)
        cat(as.character(sread(shortReadQObject)[i]), sep="",
file=fileName, append=TRUE)
        cat("\n", file=fileName, append=TRUE)
        cat(append("+", as.character(id(shortReadQObject)[[i]])), sep="",
file=fileName, append=TRUE)
        cat("\n", file=fileName, append=TRUE)
        cat(as.character(quality(shortReadQObject)[[i]]), sep="",
file=fileName, append=TRUE)
        cat("\n", file=fileName, append=TRUE)
                i <- i + 1
        }
}

Sincerely,
Tyler William H Backman
Bioinformatics Analyst
Institute for Integrative Genome Biology (IIGB)
Department of Botany and Plant Sciences
E-mail: tyler.backman at ucr.edu
1007 Noel T. Keen Hall
University of California
Riverside, CA 92521

> Hi Jim --
>
> "James W. MacDonald" <jmacdon at med.umich.edu> writes:
>
>> As a follow-up question, once you have removed the adapter sequences
>> is it then possible to write the ShortReadQ object back out in a fastq
>> format for aligning with e.g., bowtie? I know I can use
>> write.XstringSet() to output in fasta format, but I would like to feed
>> bowtie the quality scores as well.
>
> Actually a 'writeFastq' is on my 'do this week' list. In the mean time
> my best (untested) suggestion is along the lines of
>
> n <- length(fq)
> lines <- character(n)
> lines[(seq_len(n)) * 4 - 3] <-
>    paste("@", as.character(id(fq)), sep="")
> lines[(seq_len(n)) * 4 - 2] <-
>    as.character(sread(rfq1))
> lines[(seq_len(n)) * 4 - 1] <-
>     paste("+", as.character(id(fq)), sep="")
> lines[(seq_len(n)) * 4] <-
>     as.character(quality(quality(fq)))
> writeLines(lines, "my.fastq")
>
> This will be unnecessarily memory intensive.
>
>> Or is the speed reasonably comparable to use the PDict functions in
>> Biostrings to do the aligning?
>
> Bowtie is very fast. One potential 'gotcha' is the orientation of
> reads throughout the workflow -- Eland _sequence.txt files are the
> reads as they come from the machine; MAQ (and I think Bowtie) aligned
> files report reads that align to the '-' strand in their reverse
> complement sequence, with qualities reversed to match the reverse
> compelement read, and with the alignment position recorded at the
> 'left-most' (rather than 5', say) end. Not that it will be a problem
> in the current use case, just something to be aware of as you move
> between tools.
>
> Martin
>
>> Best,
>>
>> Jim
>>
>>
>>
>> Martin Morgan wrote:
>>> "David A.G" <dasolexa at hotmail.com> writes:
>>>
>>>> Dear list,
>>>>
>>>> I have some experience with Bioconductor but am newbie to this list
>>>> and to NGS. I am trying to remove some adapters from my solexa
>>>> s_N_sequence.txt file using Biostrings and ShortRead packages and the
>>>> vignettes.  I managed to read in the text file and got to save the
>>>> reads as follows
>>>>
>>>> fqpattern <- "s_4_sequence.txt"
>>>> f4 <- file.path(analysisPath(sp), fqpattern)
>>>> fq4 <- readFastq(sp, fqpattern)
>>>> reads <- sread(fq4)  #"reads" contains more than 4 million 34-length
>>>> fragments
>>>>
>>>> Having the following adapter sequence:
>>>>
>>>> adapter <- DNAString("ACGGATTGTTCAGT")
>>>>
>>>> I tried to mimic the example in the Biostring vignette as follows:
>>>>
>>>>
>>>> myAdapterAligns <- pairwiseAlignment(reads, adapter, type = "overlap")
>>>>
>>>> but after more than two hours the process is still running.
>>>>
>>>> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with
>>>> 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread on
>>>> adapter removal but does not clear things much to me, since as far as
>>>> I understood the option mentioned in the thread is not appropriate
>>>> (quote :(though apparently this is not entirely satisfactory, see the
>>>> second entry!)).
>>>>
>>>> Is this just a memory issue or am I doing something wrong? Shall I
>>>> leave the process to run for longer?
>>> As a bit of a follow-up, a reasonable strategy is to figure out how
>>> long / how much memory the calculation takes on smaller chunks of
>>> data.
>>> Also, here I 'clean' (remove reads with 'N's) and then calculate the
>>> srdistance to your adapter on a lane with 5978132 reads in about a
>>> minute
>>>
>>>> fq <- readFastq(sp, "s_1_sequence.txt")
>>>> fq
>>> class: ShortReadQ
>>> length: 5978132 reads; width: 53 cycles
>>>> system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))
>>>    user  system elapsed  60.960   0.000  60.961 I might then
>>> tabulate the distances
>>>
>>>> table(dist[[1]])
>>> and subset fq based on some criterion
>>> fq[dist[[1]] > 4]
>>> There's a second interface to this, too, using filters, e.g.,
>>> fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]
>>> Martin
>>>
>>>> TIA for your help,
>>>>
>>>> Dave
>>>>
>>>> _________________________________________________________________
>>>> Show them the way! Add maps and directions to your party invites.
>>>>
>>>> 	[[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> Hildebrandt Lab
>> 8220D MSRB III
>> 1150 W. Medical Center Drive
>> Ann Arbor MI 48109-5646
>> 734-936-8662
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>