[Bioc-sig-seq] Parallel version of the Biostrings::read.DNAStringSet and write.XStringSet functions ?
mtmorgan at fhcrc.org
mtmorgan at fhcrc.org
Thu Mar 4 00:28:20 CET 2010
Quoting Sirisha Sunkara <SSunkara at lbl.gov>:
> Hello,
>
> A newbie question: is there a parallel version available to work on
> large fasta files, for these functions already?
Hi Sirisha --
In general no, these (and other R) functions are not parallelized. The
usual strategy would be to write a script that operates on one file or
other 'chunk' of data, and then use one of snow ('easiest'), multicore
(best for multiple core on a linux computer), or Rmpi (computation
distributed across clusters) to do a version of 'lapply' (e.g.,
mclapply, mpi.parLapply) that is distributed across cores / nodes.
For read/write.*StringSet, the basic limitation is disk i/o, and you
might investigate where your data resides relative to the computer
doing the analysis, e.g., data on a networked file system can have
significant latency. Also, parallelizing on a single machine (e.g.,
multiple cores) means that the resources of that machine are used by
several processes, so one might expect to quickly run in to memory or
i/o throughout limitations.
Martin
>
> Thank You,
>
> --sirisha
>
>> sessionInfo()
> R version 2.10.1 (2009-12-14)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.14.12 IRanges_1.4.11
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.1
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
More information about the Bioc-sig-sequencing
mailing list