[Rd] read.table with more cols than headers
Gordon Smyth
smyth at wehi.EDU.AU
Wed Aug 2 04:06:02 CEST 2006
I am trying to understand the behaviour of read.table() reading
delimited files (with header=TRUE and fill=TRUE) when there are more
(possibly spurious) columns than headings. I give below four small
data files, all of which have one or two extra columns added to one
line. Reading the first file produces an error message, the second
produces a column of NA, the third adds an extra row, the fourth
ignores the extra columns with no message and no NA. Most
unintuitive! Here are my attempts to understand this, with questions
interpolated.
The behaviour on the first file seems self-explanatory. The number
of headings determines the number of columns, and extra data columns
are not allowed. (On the other hand, the help ?read.table says that
the number of columns is determined from the first five rows, which
suggests that the header line is not the only determiner. If
headers, when present, are indeed the only determiner, perhaps this
should be mentioned in the help. Are headers actually equivalent to
specifying the same set of names using the col.names argument?)
For the second file, the first column is being taken as row
names. This agrees with the help which says if "the header line has
one less entry than the number of columns, the first column is taken
to be the row names". OK, perhaps not the ideal solution for this
data file, but clearly documented behaviour.
In the third file, the extra columns are being taken to be a new
row. This seems wrong, because the help says that cases correspond
to lines. There is no suggestion in the documentation that a line of
the file could contain multiple cases. This is the result I have
most trouble with. I guess could prevent this behaviour by flush=TRUE.
File 4 is curious. Here the number of columns has been determined,
using the first 5 rows of the file, to be two. The extra column on
line 6 can't change this, so the first column doesn't become row
names. But in that case, shouldn't the extra column found on line 6
produce an error message, same as for file 1?
Specifying colClasses to be a vector of length more than 2 when
reading file 3 will produce a result similar to file 4, but with a
warning. It is not clear to me why colClasses should have an
influence, since it doesn't change the determination of the number of
columns. Why a warning here, but an error for file 1 and no message
for file 4?
Any comments gratefully received.
Gordon
X,Y
a,2
b,4,,
c,6
X,Y
a,2
b,4,
c,6
X,Y
a,2
b,4
c,6
d,8
e,10,,
f,12
X,Y
a,2
b,4
c,6
d,8
e,10,
f,12
> read.csv("test1.txt")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
> read.csv("test2.txt")
X Y
a 2 NA
b 4 NA
c 6 NA
> read.csv("test3.txt")
X Y
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 NA
7 f 12
> read.csv("test4.txt")
X Y
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 12
> read.csv("test3.txt",colClasses=c(NA,NA))
X Y
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 NA
7 f 12
> read.csv("test3.txt",colClasses=c(NA,NA,NA,NA))
X Y
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 12
Warning message:
cols = 2 != length(data) = 4 in: read.table(file = file, header =
header, sep = sep, quote = quote,
> sessionInfo()
R version 2.4.0 Under development (unstable) (2006-07-25 r38698)
i386-pc-mingw32
locale:
LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252
attached base packages:
[1] "methods" "stats" "graphics" "grDevices"
"utils" "datasets" "base"
More information about the R-devel
mailing list