[R] How to remove non-UTF-8 characters from a string
Prof Brian Ripley
ripley at stats.ox.ac.uk
Fri Oct 26 16:44:48 CEST 2007
That is not a well-defined concept. To define 'character' you need to
know the encoding, since that determines how to split the bytes into
characters. So only whole strings can be UTF-8 or not. You can say which
bytes in a stream of bytes would be valid in UTF-8, but if not all of them
are then almost certainly it would be incorrect to interpret any of them
in UTF-8.
You can find out if a stream of bytes is valid in a UTF-8 locale by
nchar(x, "c", allowNA=TRUE) and testing for NA elements in the result.
On Fri, 26 Oct 2007, Bos, Roger wrote:
> All,
>
> I am trying to post text from an XLS spread to my wiki and I need to
> remove any characters that are not UTF-8. Is there an easy gsub command
> that can do this?
>
> (I previously sent this same email to r-sig-gui. That was a mistake and
> I apologize for the duplication.)
>
> Thanks, Roger J. Bos
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list