[R] WG: AW: Another problem with encoding

Peter Dalgaard P.Dalgaard at biostat.ku.dk
Wed Jan 2 16:55:12 CET 2008


Matthias Wendel wrote:
> Hello, Peter,
> 	I tried it out: iconv(names(attributes(spss[,'Y6'])[[1]][14]), "UTF-8", "LATIN1", sub='byte') yielded 
>
> [1] "<c4>rzte Chirurgie" 
>
> and c4 corresponds in most encodings to Ä. What can I do next? I wonder whether there is a more comfortable way then to change the
> occurences of <..> by the adequate character.
>   
Not sure what you want here. Isn't it just the reverse conversion,
iconv(...., from="latin1", to="utf8") ???

Notice that c4 is not Ä in UTF8:

> iconv("Ä", to="ascii", sub="byte")
[1] "<c3><84>"

in fact c4 is not anything in UTF8, hence the "invalid string" message.
> Regards,
> Matthias
>
> -----Ursprüngliche Nachricht-----
> Von: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk] 
> Gesendet: Dienstag, 1. Januar 2008 20:21
> An: Matthias Wendel
> Betreff: Re: AW: [R] Another problem with encoding
>
> Matthias Wendel wrote:
>   
>> Happy new year and my apologies, Peter. Here are the missing facts:
>> I'm reading in a spss-file, doing some calculations and putting the 
>> results in a xml file. The xml-file is UTF-8 encoded and so should the results and their labels (eg  Ärzte Chirurgie):
>> Here is part of the R session:
>>
>>   
>>     
> As a matter of principle: Requests for more information are not offers that I will solve your problems personally. Stay on the list!
>
> The characters seem to travel OK in email, so latin1is a guess. Have you tried the sub="byte" argument to iconv()?
>
>
>
>   
>>   
>>     
>>> Sys.getlocale()
>>>     
>>>       
>> [1]
>>
>>     
> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.125
>   
>> 2"
>>   
>>     
>>> spss[,'Y6']
>>>     
>>>       
>>   [1]  6  3  8 11  8  9  6  8  3  5 10 15 NA  9  8  3  8 16  6  6 NA 10  5  2  7  7  6 16  7 15  7 10 12
>>  [34]  8  7 12 12 16  7  6  8  8 15  6 NA  8 99  7 12  8  9 16  7 16  8  7  7  1 15 12  8  7 10  7  8  7
>>  [67]  8  9  8  6  6  8  6 16 11  5 11 11  1 11  3  7  7 10 10 10  6 11 16 NA  1  3  2 10 99 10  3  3  9
>> [100]  7 16 99 16  1 10  2 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 NA 10 16 16 NA  6 10  5 11
>> [133] 11  1  1  1  1 16  1 16  1  1  1  1  6  6  6 16  8 16 16 16 16  5  6 10 99 11 11 10  6  6  1  1  6
>> [166]  1 11 11 16  9 11 16  6  8  8 16 16  8  6 16 16 12 12 12 12 12 12 12 16  9 16 15 12 12 15 10 16 15
>> [199]  4  1  2 14  4  4  2  5 NA  1  5  5  7  9  5 12 12 NA 16 12 12 12 12 12 12 12 12 12 99 NA 12 12 NA
>> [232]  1 16  1  7 11  5  6  7  1 13  6  8 16  2  1  5 16 16  9  8  8  8  7 16  8  8  2  8  5  4  6 14  5
>> [265] 14  8  8 14  4  4  8 14  8 14  6  2  3 14  3 16  5 15 15 15 15 15 15 15 15 15 15 15 13 13 13 13 13
>> [298] 13 13 13 13 13 13 13 13 15  6 NA 12  3  9  9 NA 10 16
>> attr(,"value.labels")
>>                           Verwaltung Servicegesellschaft Waldfriede (SKW) 
>>                                   16                                   15 
>>            Kurzzeitpflege Waldfriede                        Sozialstation 
>>                                   14                                   13 
>>                  Krankenpflegeschule              Med. Technischer Dienst 
>>                                   12                                   11 
>>                            Pflege OP                      Funktionsdienst 
>>                                   10                                    9 
>>                   Pflege Gynäkologie                     Pflege Chirurgie 
>>                                    8                                    7 
>>                        Pflege Innere            Ärzte Anästhesie, Röntgen 
>>                                    6                                    5 
>>                    Ärzte Gynäkologie                      Ärzte Chirurgie 
>>                                    4                                    3 
>>                         Ärzte Innere         Patientenberatung/-betreuung 
>>                                    2                                    1 
>>   
>>     
>>> names(attributes(spss[,'Y6'])[[1]][14])
>>>     
>>>       
>> [1] "Ärzte Chirurgie"
>>   
>>     
>>> iconv(names(attributes(spss[,'Y6'])[[1]][14]), "UTF-8", "LATIN1")
>>>     
>>>       
>> [1] NA
>>   
>>     
>>> utf8ToInt(names(attributes(spss[,'Y6'])[[1]][14]))
>>>     
>>>       
>> Fehler in utf8ToInt(names(attributes(spss[, "Y6"])[[1]][14])) : 
>>   invalid UTF-8 string
>>   
>>
>> Cheers,
>> Matthias
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk] 
>> Gesendet: Montag, 31. Dezember 2007 10:45
>> An: Matthias Wendel
>> Cc: r-help at stat.math.ethz.ch
>> Betreff: Re: [R] Another problem with encoding
>>
>> Matthias Wendel wrote:
>>   
>>     
>>> Hi
>>>     I've imported an spss-file using read.spss. One variable has value 
>>> like 'Ärzte'. I thought this is UTF-8 encoded, but it is not (as the results of iconv and utf8ToInt suggest). Is there any way to
>>>     
>>>       
>> find out how these spss-values are encoded?
>>   
>>     
>>>   
>>>     
>>>       
>> You are assuming a bit much of your readers.
>>
>> What exactly are you doing? Is it a value, a value label, or perhaps a variable name. How do the results of read.spss look on the
>>     
> R
>   
>> side? How did you apply iconv and utf8ToInt? What is your locale?
>>
>> I mean, we could try and guess all those details, but you are the one with the hard info, and the motivation...
>>
>>   
>>     
>
>
>   


-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907




More information about the R-help mailing list