[Bioc-sig-seq] Parallel version of the Biostrings::read.DNAStringSet and write.XStringSet functions ?

Thu Mar 4 00:28:20 CET 2010

Quoting Sirisha Sunkara <SSunkara at lbl.gov>:

> Hello,
>
> A newbie question: is there a parallel version available to work on   
> large fasta files, for these functions already?

Hi Sirisha --

In general no, these (and other R) functions are not parallelized. The  
usual strategy would be to write a script that operates on one file or  
other 'chunk' of data, and then use one of snow ('easiest'), multicore  
(best for multiple core on a linux computer), or Rmpi (computation  
distributed across clusters) to do a version of 'lapply' (e.g.,  
mclapply, mpi.parLapply) that is distributed across cores / nodes.

For read/write.*StringSet, the basic limitation is disk i/o, and you  
might investigate where your data resides relative to the computer  
doing the analysis, e.g., data on a networked file system can have  
significant latency. Also, parallelizing on a single machine (e.g.,  
multiple cores) means that the resources of that machine are used by  
several processes, so one might expect to quickly run in to memory or  
i/o throughout limitations.

Martin

>
> Thank You,
>
> --sirisha
>
>> sessionInfo()
> R version 2.10.1 (2009-12-14)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.14.12 IRanges_1.4.11
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.1
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>