[Rd] Parsing and deparsing of escaped unicode characters
Jeroen Ooms
jeroenooms at gmail.com
Mon Jul 28 10:47:57 CEST 2014
In both R and JSON (and many other languages), unicode characters can
be escaped using a backslash followed by a lowercase "u" and a 4 digit
hex code. However when deparsing a character vector in R on Windows,
the non-latin characters get escaped as "<U+" followed by their 4
digit hex code and ">":
> x <- "I like \u5BFF\u53F8"
> cat(x)
I like 寿司
> src <- deparse(x)
> cat(src)
"I like <U+5BFF><U+53F8>"
Same thing happens on linux when we disable UTF8:
Sys.setlocale("LC_ALL", "C")
x <- "I like \u5BFF\u53F8"
nchar(x) #9, seems OK
cat(deparse(x))
"I like <U+5BFF><U+53F8>"
As a result, the code does not parse() back into the proper unicode
characters. I am currently using a regular expression to convert the
output of deparse into something that parse() (and json) supports:
utf8conv <- function(x) {
gsub("<U\\+([0-9A-F]{4})>","\\\\u\\1",x)
}
> src <- utf8conv(src)
> y <- parse(text=src)[[1]]
> identical(x, y)
[1] TRUE
However this is suboptimal because it introduces a big performance
overhead for large text. Several things are unclear to me:
- Why does deparse() use a different escape notation than parse? Is
there a way to make deparse output \uXXXX for unicode instead?
- Why does deparse on windows escape this in the first place, and not
keep the actual character when the locale supports it?
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
More information about the R-devel
mailing list