[R] A read.table mystery (data for Framemaker Mac)
Emmanuel Charpentier
charpent at bacbuc.dyndns.org
Thu Oct 11 22:42:41 CEST 2007
Dear list,
I have to read some clinical data a file coming from Filemaker on
Macintosh (Ugh ! But it could be worse and come from Excel...).
Exporting via Excel is out of question since the file has 467 columns
and 121 lines (+ headers), which is out of reach of Excel. So I received
an "mer"" files, which is what Filemaker exports as a text file.
It seems to be a semicolon delimited file, with all data item enclosed
in double quotes, and header items not delimited.
So, first attempt is :
In bash :
# Convert from apparent Latin-1 to UTF-8
charpent at yod:/media/EMTEC/MARS-Point2007$ cat export.mer | iconv -f
latin1 -t utf8 > export-utf8.mer
# Check size (line count)
charpent at yod:/media/EMTEC/MARS-Point2007$ wc export-utf8.mer 122
8632 311252 export-utf8.mer
In R :
foo<-read.table("export-utf8.mer", header=TRUE, sep=";", dec=",",
check.names=TRUE, strip.white=TRUE, encoding="UTF-8")
/* Looong wait ! Almost 1 minute on a 3,2 GHz dual core... */
Warning message:
In read.table("export-utf8.mer", header = TRUE, sep = ";", dec = ",", :
readTableHeader a trouvé une ligne finale incomplète dans
'export-utf8.mer'
> dim(foo)
[1] 0 468
Ouch ! One column too much, and no lines...
Trying to print foo seems to start R in an endless loop. According to
emacs (ESS is my favorite interface to R), R tries to print more than 1
million lines (empty) before I give up and kill emacs.
Second attempt : separate headers from data :
In bash :
# Same conversion, discarding the first line
charpent at yod:/media/EMTEC/MARS-Point2007$ cat export.mer | iconv -f
latin1 -t utf8 | tail -n +2 > data.utf8
# check size
charpent at yod:/media/EMTEC/MARS-Point2007$ wc data.utf8 121 7581
301942 data.utf8
# Convert header line, and stack headers by replacing the separator (;)
# by a newline
charpent at yod:/media/EMTEC/MARS-Point2007$ cat export.mer | iconv -f
latin1 -t utf8 | head -n 1 | sed -e "s/;/\n/g" > headers.utf8
# Check size
charpent at yod:/media/EMTEC/MARS-Point2007$ wc headers.utf8
467 1517 9310 headers.utf8
In R :
> bar<-read.table("data.utf8", header=FALSE, sep=";", dec=",",
strip.white=TRUE, encoding="UTF-8")
> dim(bar)
[1] 121 467
/* Fine ... */
> hdrs<-read.table("headers.utf8", header=FALSE, sep=";",
strip.white=TRUE, encoding="UTF-8")
> dim(hdrs)
[1] 467 1
/* Also fine. Let's use these names : */
> names(bar)<-make.names(hdrs[,1])
This is quite fast, and the resulting data frame *seems* to be fine and
dandy.
I have found a pragmatic solution, which (more or less) solves my
problem. But I'm frustrated at understanding the root of the problem.
Of course, I also tried to read the header line alone as a semicolon
separated file of one line : it fails with a warning about an
unterminated final line and returns a data frame with 467 columns and 0
line.
And, by the way (let's try to be Ripley'd for something else than
forgetting to read the FAQ) :
> R.Version()
$platform
[1] "i486-pc-linux-gnu"
$arch
[1] "i486"
$os
[1] "linux-gnu"
$system
[1] "i486, linux-gnu"
$status
[1] ""
$major
[1] "2"
$minor
[1] "6.0"
$year
[1] "2007"
$month
[1] "10"
$day
[1] "03"
$`svn rev`
[1] "43063"
(This is the version packaged for Ubuntu grabbed from the relevant R
repository).
The R FAQ was unhelpful in this special case. The import/export guide
suggests to fiddle with delimiters and quoting, which I did, to no avail.
Searching for read.table in the searchable archives of R help gives
about 900 références for 2007. An honest effort to read the headers and
seemingly relevant messages did not shed any light.
My best clue so far is that the data items are enclosed in double
quotes, while items of the header line are not (just separated by
semicolons) ; I do not know how to coax read.table to read *that*.
I would appreciate any hint as to the source of the problem and a
possible solution (I try to find an as automated as possible way,
because I think I will have to read successive versions of this data
file many times...).
Sincerely yours,
Emmanuel Charpentier
More information about the R-help
mailing list