[Bioc-sig-seq] Read big fastq files in chunks

Anita Lerch anita.lerch at fmi.ch
Tue Sep 27 11:39:05 CEST 2011


Hi,
I should filter a big fastq file (HiSeq ca. >30'000'000, which does not
fit in memory) for low qualities e.g. to many N in the sequence.

I plan to read the file in chunks, filter the ShortReads and write them
into a new file.
This is no problem for fasta files, see the example:

filename <- "data/DmS2DRSC_RNA_seqs_subsample.fa"
eof <- FALSE
append <- FALSE
cycle <- 1L
while(!eof){
  chunk <- readFasta(filename, nrec=nrec, skip=(cycle-1)*nrec)
  nFilt <- nFilter(5)
  writeFasta(chunk[nFilt(chunk)],
             sprintf("%s/filtered_%s", 
                     dirname(filename), basename(filename)),
             append=append)
  append <- TRUE
  cycle <- cycle + 1
  if(length(chunk) == 0L)
      eof <- TRUE
}

If I try to do the same for fastq file e.g.

filename <- "data/test_data.fq"
eof <- FALSE
mode <- "w"
cycle <- 1L
while(!eof){
  chunk <- readFastq(filename)
  nFilt <- nFilter(5)
  writeFastq(chunk[nFilt(chunk)],
             sprintf("%s/filtered_%s", 
                     dirname(filename), basename(filename)),
             mode=mode)
  mode <- "a"
  cycle <- cycle + 1
  if(length(chunk) == 0L)
      eof <- TRUE
  }

I facing some problems when I read the ShortReads:
- read.DNAStringSet ignores the quality sequence. 
- readFastq can't read in chunks
- FastqSampler choose a random sample of ShortRead

Of course I can do a workaround e.g writing my own fastq parser, but I
prefer a clean solution or to solve the problem.
So my questions, is there a clean solution I'm not aware of it?
If I'm right and there is this problem, what is the plan of the
community to solve this problem or is this already work in progress?
Can I help to solve this problem e.g. rewriting the readFastq()
function.

Thanks for the answer.
Greetings
Anita

> sessionInfo()
R Under development (unstable) (2011-09-20 r57033)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ShortRead_1.11.37    latticeExtra_0.6-18  RColorBrewer_1.0-5   Rsamtools_1.5.65    
[5] lattice_0.19-33      Biostrings_2.21.9    GenomicRanges_1.5.47 IRanges_1.11.26     

loaded via a namespace (and not attached):
 [1] Biobase_2.13.9       BiocInstaller_1.1.27 BSgenome_1.21.3      grid_2.14.0         
 [5] hwriter_1.3          QuasR_0.1.1          RCurl_1.6-10         rtracklayer_1.13.13 
 [9] tools_2.14.0         XML_3.4-3            zlibbioc_0.1.7    
 
-- 
Anita Lerch
Friedrich Miescher Institute
Maulbeerstrasse 66
WRO-1066.P22
4058 Basel
Phone: +41 (0)61 697 5172
Email: anita.lerch at fmi.ch



More information about the Bioc-sig-sequencing mailing list