[R] Reading .csv file under linux

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Jan 23 00:14:52 CET 2008


On Wed, 23 Jan 2008, David Scott wrote:

> On Tue, 22 Jan 2008, Prof Brian Ripley wrote:
>
>> On Wed, 23 Jan 2008, David Scott wrote:
>> 
>>> 
>>> I have encountered a problem with reading a .csv file on a linux box. I
>>> can read the file on my windows machine (under XP) but on the linux box it
>>> gives :
>>> 
>>>> patients <- read.csv("../Patients.csv", header = FALSE,
>>> +                      col.names = patientsNames)
>>> Error in type.convert(data[[i]], as.is = as.is[i], dec = dec,
>>> na.strings = character(0)) :
>>>   invalid multibyte string
>>> Calls: read.csv -> read.table -> type.convert
>>> Execution halted
>>> 
>>> I am running R 2.6.1 on both machines. I tried on another linux box
>>> running 2.5.1 and got the same problem
>>> 
>>> I am guessing it is something to do with the character encoding. On the
>>> linux box I have
>>> 
>>> LANG=en_US.UTF-8
>> 
>> So what encoding is the .csv file in?  Consider the example at the end of 
>> ?file
>>
>>     ## examples of use of encodings
>>     cat(x, file = file("foo", "w", encoding="UTF-8"))
>>     # read a 'Windows Unicode' file including names
>>     A <- read.table(file("students", encoding="UCS-2LE"))
>> 
>> and adapt accordingly (encoding = "CP1252" is the most likely value if this 
>> works in English-language Windows).
>> 
>
>
> Thanks Brian for the super-quick, super-helpful reply. The encoding you 
> suggested worked.
>
> I found a workaround myself too---I guessed that some plus/minus signs might 
> be the problem and replaced them and could read in the file.
> That is just a kludge so I am using the encoding specification.
>
> I am a total dunce when it comes to encodings though. How do you find the 
> encoding of a file?

You ask the person who gave it to you.  You can't in general tell, and 
e.g. ISO-8859-1 and ISO-8859-2 are only distinguishable by someone who can 
read the contents (if it is a human language).  If you have just the odd 
symbol (e.g. degree sign or plus/minus) you can be completely stuck.

'file' on Linux can usually guess if a file is UTF-8 or ISO-8859-?, but 
not of course what ? is.  But guesses are based on statistical patterns 
and are good for text but not so good for data.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list