[Rd] Change in the RNG implementation?
Hervé Pagès
hpages at fhcrc.org
Mon Oct 22 08:02:50 CEST 2012
Hi Duncan, Martin,
Thanks for your answers.
For my real case I was generating millions of random positions
on a genome.
I compared sample.int() performance between R-2.15.1 and R-devel,
and, for me, it performs better in R-2.15.1 (almost 3x faster and
also uses slightly less memory):
With R-2.15.1:
> set.seed(33)
> system.time(random_chrom_pos <- sample(199000666L, 95000777L))
user system elapsed
4.964 0.268 5.242
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 137285 7.4 350000 18.7 350000 18.7
Vcells 47633785 363.5 154735917 1180.6 147135703 1122.6
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
With R-devel:
> set.seed(33)
> system.time(random_chrom_pos <- sample(199000666L, 95000777L))
user system elapsed
14.532 0.296 14.854
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 145525 7.8 350000 18.7 350000 18.7
Vcells 47644082 363.5 152959996 1167.0 182023372 1388.8
> sessionInfo()
R Under development (unstable) (2012-10-02 r60861)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
FWIW my R-2.15.1 and R-devel were configured with
--disable-byte-compiled-packages, otherwise, I use all the
defaults. Also my system is a standard Ubuntu 12.04 installation
with no fancy settings/tweakings/customizations.
Thanks,
H.
On 10/20/2012 12:50 PM, Martin Maechler wrote:
>>>>>> Duncan Murdoch <murdoch.duncan at gmail.com>
>>>>>> on Fri, 19 Oct 2012 19:26:39 -0400 writes:
>
> > On 12-10-19 7:04 PM, Hervé Pagès wrote:
> >> Hi,
> >>
> >> Looks like the implementation of random number generation changed in
> >> R-devel with respect to R-2.15.1.
> >>
> >> With R-2.15.1:
> >>
> >> > set.seed(33)
> >> > sample(49821115, 10)
> >> [1] 22217252 19661919 24099911 45779422 42043111 25774933 21778053
> >> 17098516
> >> [9] 773073 5878451
> >>
> >> With recent R-devel:
> >>
> >> > set.seed(33)
> >> > sample(49821115, 10)
> >> [1] 22217252 19661919 24099912 45779425 42043115 25774935 21778056
> >> 17098518
> >> [9] 773073 5878452
> >>
> >> This is on a 64-bit Ubuntu system.
> >>
> >> Is this change intended? I didn't see anything in the NEWS file.
> >>
> >> A potential problem with this is that it will break unit tests
> >> for algorithms that make use of RNG.
> >>
> >> Another more practical problem (at least for me) is the following:
> >> Bioconductor package maintainers are sometimes working hard on the
> >> development version of their package to improve the performance of
> >> some key functions. Comparing performance between BioC release
> >> (based on R-2.15) and devel (based on R-devel) often requires big
> >> input data that is randomly generated, because it's easiest than
> >> working with real data. Typically a small script is written that
> >> takes care of loading the required packages, generating the input
> >> data, and running a simple analysis. The same script is sourced in
> >> R-2.15 and R-devel, and performance and results are compared.
> >>
> >> Not being able to generate exactly the same input in the script is
> >> a problem. It can be worked around by generating the input once,
> >> serializing it, and use load() in the script, but that makes things
> >> more complicated and the script is not a standalone script anymore
> >> (cannot be passed around without also passing around the big .rda
> >> file).
> >>
> >> Thanks,
> >> H.
> >>
>
> > I think it was mentioned in the NEWS:
>
> > \code{sample.int()} has some support for \eqn{n \ge
> > 2^{31}}{n >= 2^31}: see its help for the limitations.
>
> > A different algorithm is used for \code{(n, size, replace = FALSE,
> > prob = NULL)} for \code{n > 1e7} and \code{size <= n/2}. This
> > is much faster and uses less memory, but does give different results.
>
> So, to iterate : The RNG has not been changed at all,
> but sample() has, for extreme cases (large n) like yours.
>
> > I don't think the old algorithm is available, but perhaps it could be
> > made available by an optional parameter.
>
> I do think we should ideally add such an option or probably
> rather allow the more thorough way of either using
> RNGversion(..) or something similar to set sample()'s behavior
> to exactly as previously.
> Doing "globally" is really needed, as sample() maybe called from a
> function (from a function from a function) that is not in the
> programmer's hand, and so the programmeR could not even
> set the new optional argument if he found out that he had to.
>
> Honestly, I'm surprised Hervé found a real case where the
> difference is visible.
>
> Martin
>
>
> > Duncan Murdoch
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-devel
mailing list