[Rd] read.table with more cols than headers

Wed Aug 2 04:06:02 CEST 2006

I am trying to understand the behaviour of read.table() reading 
delimited files (with header=TRUE and fill=TRUE) when there are more 
(possibly spurious) columns than headings.  I give below four small 
data files, all of which have one or two extra columns added to one 
line.  Reading the first file produces an error message, the second 
produces a column of NA, the third adds an extra row, the fourth 
ignores the extra columns with no message and no NA.  Most 
unintuitive!  Here are my attempts to understand this, with questions 
interpolated.

The behaviour on the first file seems self-explanatory.  The number 
of headings determines the number of columns, and extra data columns 
are not allowed.  (On the other hand, the help ?read.table says that 
the number of columns is determined from the first five rows, which 
suggests that the header line is not the only determiner.  If 
headers, when present, are indeed the only determiner, perhaps this 
should be mentioned in the help.  Are headers actually equivalent to 
specifying the same set of names using the col.names argument?)

For the second file, the first column is being taken as row 
names.  This agrees with the help which says if "the header line has 
one less entry than the number of columns, the first column is taken 
to be the row names".  OK, perhaps not the ideal solution for this 
data file, but clearly documented behaviour.

In the third file, the extra columns are being taken to be a new 
row.  This seems wrong, because the help says that cases correspond 
to lines.  There is no suggestion in the documentation that a line of 
the file could contain multiple cases.  This is the result I have 
most trouble with.  I guess could prevent this behaviour by flush=TRUE.

File 4 is curious.  Here the number of columns has been determined, 
using the first 5 rows of the file, to be two.  The extra column on 
line 6 can't change this, so the first column doesn't become row 
names.  But in that case, shouldn't the extra column found on line 6 
produce an error message, same as for file 1?

Specifying colClasses to be a vector of length more than 2 when 
reading file 3 will produce a result similar to file 4, but with a 
warning.  It is not clear to me why colClasses should have an 
influence, since it doesn't change the determination of the number of 
columns.  Why a warning here, but an error for file 1 and no message 
for file 4?

Any comments gratefully received.
Gordon

X,Y
a,2
b,4,,
c,6

X,Y
a,2
b,4,
c,6

X,Y
a,2
b,4
c,6
d,8
e,10,,
f,12

X,Y
a,2
b,4
c,6
d,8
e,10,
f,12

 > read.csv("test1.txt")
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
         more columns than column names
 > read.csv("test2.txt")
   X  Y
a 2 NA
b 4 NA
c 6 NA
 > read.csv("test3.txt")
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6   NA
7 f 12
 > read.csv("test4.txt")
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6 f 12
 > read.csv("test3.txt",colClasses=c(NA,NA))
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6   NA
7 f 12
 > read.csv("test3.txt",colClasses=c(NA,NA,NA,NA))
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6 f 12
Warning message:
cols = 2 != length(data) = 4 in: read.table(file = file, header = 
header, sep = sep, quote = quote,

 > sessionInfo()
R version 2.4.0 Under development (unstable) (2006-07-25 r38698)
i386-pc-mingw32

locale:
LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252

attached base packages:
[1] "methods"   "stats"     "graphics"  "grDevices" 
"utils"     "datasets"  "base"