[R-pkgs] ff version 2.2.0
Jens Oehlschlägel
jens.oehlschlaegel at truecluster.com
Fri Oct 1 17:36:04 CEST 2010
Dear R community,
The next release of package ff is available on CRAN. With kind help of Brian Ripley it now supports the Win64 and Sun versions of R. It has three major functional enhancements:
a) new fast in-memory sorting and ordering functions (single-threaded)
b) ff now supports on-disk sorting and ordering of ff vectors and ffdf dataframes
c) ff integer vectors now can be used as subscripts of ff vectors and ffdf dataframes
a) is achieved by careful implementation of NA-handling and exploiting context information
b) although permanently stored, sorting and ordering of ff objects can be faster than the standard routines in R
c) applying an order to ff vectors and ffdf dataframes is substantially slower than in pure R because it involves disk-access AND sorting index positions (to avoid random access).
There is still room for improvement, however, the current status should already be useful. I run some comparisons with SAS (see end of mail):
- both could sort German census size (81e6 rows) on a 3GB notebook
- ff sorts and orders faster on single columns
- sorting big multicolumn-tables is faster in SAS
Win64 binaries and version 2.2.1 supporting Sun should appear during the next days on CRAN. For the impatient: checkout from r-forge with revision 67 or higher.
Non-Windows users: please note that you need to set appropriate values for options 'ffbatchbytes' and 'ffmaxbytes' yourself.
Note that virtual window support is deprecated now because it leads to too complex code. Let us know if you urgently need this and why.
Feedback, ideas and contributions appreciated. To those who offered code during the last months: please forgive us that integrating and documenting was not possible with this release.
Jens & Daniel
P.S. NEWS
CHANGES IN ff VERSION 2.2.0
NEW FEATURES
o ff now supports the 64 bit Windows and Sun versions of R
(thanks to Brian Ripley)
o ff now supports sorting and ordering of ff vectors and dataframes
(see ramsort, ffsort, ffdfsort, ramorder, fforder, ffdforder)
o ff now supports ff vectors as subscripts of ff objects
(currently positive integers only, booleans are planned)
o New option 'ffmaxbytes' which allows certain ff procedures like sorting
using larger limit of RAM than 'ffbatchbytes' in chunked processing.
Such higher limit is useful for (single-R-process) sorting compared to
some multi-R-process chunked processing. It is a good idea to reduce
'ffmaxbytes' on slaves or avoid ff sorting there completely.
o New generic 'pagesize' with method 'pagesize.ff' which returns the
current pagesize as defined on opening the ff object.
USER VISIBLE CHANGES
o [.ff now returns with the same vmode as the ff-object
o Certain operations are faster now because we worked around
unnecessary copying triggered by many of R's assignment functions.
For example reading a factor from a (well-cached) file is now 20%
faster and thus as fast as just creating this factor in-RAM using
levels()<- and class()<- assignments.
(consider this tuning temporary, hoping for a generic fix in base R)
o ff() can now open files larger than .Machine$integer.max elements
(but gives access only to the first .Machine$integer.max elements)
o ff now has default pattern NULL translating to the pattern in 'filename'
(and only to the previous default 'ff' if no filename is given)
o ff now sets the pattern in synch with a requested 'filename'
o clone.ff now always creates a file consistent with the previous pattern
o clone.ff now always creates a finalizer consistent with the file location
o clone.ffdf has a new argument 'nrow' which allows to create an empty copy
with a different number of rows (currently requires 'initdata=NULL')
o clone.default now deep-copies lists and atomic vectors
DEPRECATED
o virtual window support is deprecated. Let us know if you urgently need this and why.
BUG FIXES
o read.table.ffdf now also works if transFUN filters and returns less rows
BUG FIXES at 2.1.4
o [<-.ffdf no longer does calculate the number of elements in an ffdf
which could led to an integer overflow
BUG FIXES at 2.1.3
o ffsafe now always closes ffdf objects - also partially closed ones
o ffsafe no longer passes arguments 'add' and 'move' to 'save'
o ffsafe and friends now work around the fact that under windows getwd()
can report the same path in upper and lower case versions.
CHANGES IN bit VERSION 1.1.5
NEW FEATURES
o new utility functions setattr() and setattributes() allow to set attributes
by reference (unlike attr()<- attributes()<- without copying the object)
o new utility unattr() returns copy of input with attributes removed
USER VISIBLE CHANGES
o certain operations like creating a bit object are even faster now: need
half the time and RAM through the use of setattr() instead of attr()<-
o [.bit now decorates its logical return vector with attr(,'vmode')='boolean',
i.e. we retain the information that there are no NAs.
BUG FIXES
o .onLoad() no longer calls installed.packages() which substantially
improves startup time (thanks to Brian Ripley)
P.P.S. Below are some timings in seconds at 3e6, 9e6, 27e6 and 81e6 elements from a Lenovo 410s notebook
(3GB RAM, i5 m520, 2 real cores, 4 hyperthreaded cores, SSD drive, Windows7 32bit)
Legend for software
ram: new in-ram inplace operations receiving enough RAM to optimize for speed, not for memory
ff: new on-disk operations limiting RAM for this operation at ~500GB
R: timings from standard sort() and order()
SAS: timings from SAS 9.2 allowing for multithreaded sorting
Legend for type of random data
rboolean: bi-boolean with 50% FALSE and TRUE
rlogical: tri-boolean with 33% NA, FALSE and TRUE
rubyte: integers from 0..255
rbyte: 33% NA and 67% -127..127
rushort: integers from 0..65535
rshort: 33% NA and 67% -32767..32767
ruinteger: 50% NA and 50% integers
rinteger: random integers
rusingle: 50% NA and 50% singles
rsingle: random singles
rudouble: 50% NA and 50% doubles
rdouble: doubles
rfactor: factor with 64 levels of length 66 (being different at bytes 65 and 66)
rchar: 64 strings of length 66 (being different at bytes 65 and 66)
Legend for abbreviations
OOM: out of memory
OOD: out of disk
NT: not timed because too slow
NA: not available
Results for sorting a single column
=====================================
, , 3e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.02 0.03 0.02 0.04 0.02 0.02 0.17 0.11 0.66 0.36 0.66 0.36 0.03 NA
ff 0.25 0.33 0.22 0.25 0.28 0.26 0.38 0.30 1.02 0.65 0.92 0.67 0.39 NA
R NA 0.35 NA NA NA NA 0.83 0.54 NA NA 1.28 0.90 64.83 51.20
SAS NA NA NA NA NA NA 1.61 1.32 NA NA 1.57 1.29 NA 17.01
, , 9e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.04 0.07 0.03 0.08 0.03 0.07 0.50 0.31 1.88 0.97 1.87 0.97 0.04 NA
ff 0.72 0.93 0.61 0.73 0.84 0.75 1.08 0.86 2.68 1.62 2.57 1.67 0.78 NA
R NA 0.90 NA NA NA NA 2.84 1.78 NA NA 3.51 2.12 NA NT
SAS NA NA NA NA NA NA 4.99 3.90 NA NA 4.91 4.48 NA 62.76
, , 27e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.10 0.24 0.09 0.23 0.11 0.23 1.58 1.00 6.06 3.15 6.00 3.23 0.16 NA
ff 2.19 2.98 1.92 2.21 2.56 2.31 3.22 2.68 8.49 5.18 8.10 5.35 2.58 NA
R NA 2.72 NA NA NA NA 9.69 5.80 NA NA 12.34 6.97 NA NT
SAS NA NA NA NA NA NA 17.02 12.67 NA NA 17.05 14.07 NA 176.63
, , 81e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.27 0.67 0.28 0.67 0.33 0.72 5.58 3.23 NA NA NA NA 0.49 NA
ff 6.56 9.06 5.93 6.88 8.52 7.15 10.70 8.54 51.35 28.98 70.20 44.13 7.91 NA
R OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM
SAS NA NA NA NA NA NA 61.45 44.94 NA NA 63.14 46.56 NA OOD
Results for calculating the order on a single column
====================================================
, , 3e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.05 0.07 0.04 0.07 0.09 0.11 0.92 0.53 1.46 0.81 1.31 0.64 0.06 NA
ff 0.14 0.19 0.77 0.58 0.87 0.67 1.04 0.60 1.66 0.81 1.43 0.85 0.74 NA
R NA 3.23 NA NA NA NA 4.57 4.07 NA NA 5.27 4.61 4.59 193.75
SAS NA NA NA NA NA NA 1.86 1.48 NA NA 1.63 1.39 NA 16.83
, , 9e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.16 0.21 0.17 0.20 0.30 0.28 3.07 1.61 4.24 2.16 4.22 2.19 0.19 NA
ff 0.48 0.51 2.45 1.84 2.91 2.15 3.38 1.92 4.72 2.48 4.54 2.45 1.91 NA
R NA 12.31 NA NA NA NA 17.02 15.56 NA NA 16.96 15.47 NT NT
SAS NA NA NA NA NA NA 6.71 5.97 NA NA 6.25 5.41 NA 59.27
, , 27e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 0.51 0.67 0.5 0.69 0.92 0.94 9.89 5.31 15.13 7.69 15.15 7.70 0.58 NA
ff 1.33 1.51 7.6 5.77 9.25 6.79 10.72 6.12 15.98 8.53 15.96 8.92 5.80 NA
R NA 46.37 NA NA NA NA 65.57 59.17 NA NA 63.74 58.37 NT NT
SAS NA NA NA NA NA NA 21.41 18.77 NA NA 20.22 18.84 NA 182.74
, , 81e6
rboolean rlogical rubyte rbyte rushort rshort ruinteger rinteger rusingle rsingle rudouble rdouble rfactor rchar
ram 1.49 2.03 1.5 2.06 3.15 2.98 34.33 17.89 NA NA NA NA 1.90 NT
ff 3.98 4.65 22.9 17.42 30.33 21.82 36.68 20.36 77.16 49.55 125.01 59.27 17.39 NT
R OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM
SAS NA NA NA NA NA NA 86.24 70.32 NA NA 84.40 68.66 NA NA
Results for sorting all columns of a table with m columns of random double data (without NAs)
=============================================================================================
, , 3e6
ncol 1 2 5 10 20
SAS 1.65 1.83 3.71 6.90 14.06
ff 1.97 2.37 3.75 6.21 10.86
R 4.70 5.67 5.65 6.46 8.06
, , 9e6
ncol 1 2 5 10 20
SAS 5.18 6.70 14.02 19.25 41.65
ff 6.38 7.96 12.12 19.58 45.43
R 18.86 19.20 20.58 OOM OOM
, , 27e6
ncol 1 2 5 10 20
SAS 17.79 19.52 35.03 83.30 142.09
ff 22.68 25.79 46.25 87.55 157.62
R 65.56 OOM OOM OOM OOM
, , 81e6
ncol 1 2 5 10 20
SAS 64.78 83.39 143.59 242.23 408.72
ff 167.52 220.03 324.03 502.42 884.03
R OOM OOM OOM OOM OOM
More information about the R-packages
mailing list