[Rd] [RFC] A case for freezing CRAN

Tom Short tshort.rlists at gmail.com
Fri Mar 21 17:16:12 CET 2014


For me, the most important aspect is being able to reproduce my own
work. Some other tools offer interesting approaches to managing
packages:

* NPM -- The Node Package Manager for Node.js loads a local copy of
all packages and dependencies. This helps ensure reproducibility and
avoids dependency issues. Different projects in different directories
can then use different package versions.

* Julia -- Julia's package manager is based on git, so users should
have a local copy of all package versions they've used. Theoretically,
you could use separate git repos for different projects, and merge as
desired.

I've thought about putting my local R library into a git repository.
Then, I could clone that into a project directory and use
.libPaths(".Rlibrary")  in a .Rprofile file to set the library
directory to the clone. In addition to handling package versions, this
might be nice for installing packages that are rarely used (my library
directory tends to get cluttered if I start trying out packages).
Another addition could be a local script that starts a specific
version of R.

For now, I don't have much incentive to do this. For the packages that
I use, R's been pretty good to me with backwards compatibility.

I do like the idea of a CRAN mirror that's under version control.




On Tue, Mar 18, 2014 at 4:24 PM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
> This came up again recently with an irreproducible paper. Below an
> attempt to make a case for extending the r-devel/r-release cycle to
> CRAN packages. These suggestions are not in any way intended as
> criticism on anyone or the status quo.
>
> The proposal described in [1] is to freeze a snapshot of CRAN along
> with every release of R. In this design, updates for contributed
> packages treated the same as updates for base packages in the sense
> that they are only published to the r-devel branch of CRAN and do not
> affect users of "released" versions of R. Thereby all users, stacks
> and applications using a particular version of R will by default be
> using the identical version of each CRAN package. The bioconductor
> project uses similar policies.
>
> This system has several important advantages:
>
> ## Reproducibility
>
> Currently r/sweave/knitr scripts are unstable because of ambiguity
> introduced by constantly changing cran packages. This causes scripts
> to break or change behavior when upstream packages are updated, which
> makes reproducing old results extremely difficult.
>
> A common counter-argument is that script authors should document
> package versions used in the script using sessionInfo(). However even
> if authors would manually do this, reconstructing the author's
> environment from this information is cumbersome and often nearly
> impossible, because binary packages might no longer be available,
> dependency conflicts, etc. See [1] for a worked example. In practice,
> the current system causes many results or documents generated with R
> no to be reproducible, sometimes already after a few months.
>
> In a system where contributed packages inherit the r-base release
> cycle, scripts will behave the same across users/systems/time within a
> given version of R. This severely reduces ambiguity of R behavior, and
> has the potential of making reproducibility a natural part of the
> language, rather than a tedious exercise.
>
> ## Repository Management
>
> Just like scripts suffer from upstream changes, so do packages
> depending on other packages. A particular package that has been
> developed and tested against the current version of a particular
> dependency is not guaranteed to work against *any future version* of
> that dependency. Therefore, packages inevitably break over time as
> their dependencies are updated.
>
> One recent example is the Rcpp 0.11 release, which required all
> reverse dependencies to be rebuild/modified. This updated caused some
> serious disruption on our production servers. Initially we refrained
> from updating Rcpp on these servers to prevent currently installed
> packages depending on Rcpp to stop working. However soon after the
> Rcpp 0.11 release, many other cran packages started to require Rcpp >=
> 0.11, and our users started complaining about not being able to
> install those packages. This resulted in the impossible situation
> where currently installed packages would not work with the new Rcpp,
> but newly installed packages would not work with the old Rcpp.
>
> Current CRAN policies blame this problem on package authors. However
> as is explained in [1], this policy does not solve anything, is
> unsustainable with growing repository size, and sets completely the
> wrong incentives for contributing code. Progress comes with breaking
> changes, and the system should be able to accommodate this. Much of
> the trouble could have been prevented by a system that does not push
> bleeding edge updates straight to end-users, but has a devel branch
> where conflicts are resolved before publishing them in the next
> r-release.
>
> ## Reliability
>
> Another example, this time on a very small scale. We recently
> discovered that R code plotting medal counts from the Sochi Olympics
> generated different results for users on OSX than it did on
> Linux/Windows. After some debugging, we narrowed it down to the XML
> package. The application used the following code to scrape results
> from the Sochi website:
>
> XML::readHTMLTable("http://www.sochi2014.com/en/speed-skating", which=2, skip=1)
>
> This code was developed and tested on mac, but results in a different
> winner on windows/linux. This happens because the current version of
> the XML package on CRAN is 3.98, but the latest mac binary is 3.95.
> Apparently this new version of XML introduces a tiny change that
> causes html-table-headers to become colnames, rather than a row in the
> matrix, resulting in different medal counts.
>
> This example illustrates that we should never assume package versions
> to be interchangeable. Any small bugfix release can have side effects
> altering results. It is impossible to protect code against such
> upstream changes using CMD check or unit testing. All R scripts and
> packages are really only developed and tested for a single version of
> their dependencies. Assuming anything else makes results
> untrustworthy, and code unreliable.
>
> ## Summary
>
> Extending the r-release cycle to CRAN seems like a solution that would
> be easy to implement. Package updates simply only get pushed to the
> r-devel branches of cran, rather than r-release and r-release-old.
> This separates development from production/use in a way that is common
> sense in most open source communities. Benefits for R include:
>
> - Regular R users (statisticians, researchers, students, teachers) can
> share their homemade scripts/documents/packages and rely on them to
> work and produce the same results within a given version of R, without
> manual efforts to manage package versions.
>
> - Package authors can publish breaking changes to the devel branch
> without causing major disruption or affecting users and/or
> maintainers. Authors of depending packages have a timeframe to sync
> their package with upstream changes before the next release.
>
> - CRAN maintainers can focus quality control and testing efforts on
> the devel branch around the time of the code freeze. No need for
> crisis management when a package update introduces some severe
> breaking changes. Users of released versions are unaffected.
>
>
> [1] http://journal.r-project.org/archive/2013-1/ooms.pdf
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list