[Rd] Deep copy of factor levels?
Kirill Müller
kirill.mueller at ivt.baug.ethz.ch
Mon Mar 17 10:13:37 CET 2014
Hi
It seems that selecting an element of a factor will copy its levels
(Ubuntu 13.04, R 3.0.2). Below is the output of a script that creates a
factor with 10000 elements and then calls as.list() on it. The new
object seems to use more than 700 MB, and inspection of the levels of
the individual elements of the list suggest that they are distinct objects.
Perhaps some performance gain could be achieved by copying the levels
"by reference", but I don't know R internals well enough to see if it's
possible. Is there a particular reason for creating a full copy of the
factor levels?
This has come up when looking at the performance of rbind.fill (in the
plyr package) with factors: https://github.com/hadley/plyr/issues/206 .
Best regards
Kirill
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 325977 17.5 1074393 57.4 10049951 536.8
Vcells 4617168 35.3 87439742 667.2 204862160 1563.0
> system.time(x <- factor(seq_len(1e4)))
user system elapsed
0.008 0.000 0.007
> system.time(xx <- as.list(x))
user system elapsed
4.263 0.000 4.322
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 385991 20.7 1074393 57.4 10049951 536.8
Vcells 104672187 798.6 112367694 857.3 204862160 1563.0
> .Internal(inspect(levels(xx[[1]])))
@387f620 16 STRSXP g1c7 [MARK,NAM(2)] (len=10000, tl=0)
@144da4e8 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "1"
@144da518 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "2"
@27d1298 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "3"
@144da548 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "4"
@144da578 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "5"
...
> .Internal(inspect(levels(xx[[2]])))
@1b38cb90 16 STRSXP g1c7 [MARK,NAM(2)] (len=10000, tl=0)
@144da4e8 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "1"
@144da518 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "2"
@27d1298 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "3"
@144da548 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "4"
@144da578 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "5"
...
More information about the R-devel
mailing list