[Rd] [PATCH] Improve utf8clen and remove utf8_table4

Duncan Murdoch murdoch.duncan at gmail.com
Sun Mar 19 13:38:18 CET 2017


On 19/03/2017 2:31 AM, Sahil Kang wrote:
> Given a char `c' which should be the start byte of a utf8 character,
> the utf8clen function returns the byte length of the utf8 character.
>
> Before this patch, the utf8clen function would return either:
>      * 1 if `c' was an ascii character or a utf8 continuation byte
>      * An int in the range [2, 6] indicating the byte length of the utf8
> character
>
> With this patch, the utf8clen function will now return either:
>      * -1 if `c' is not a valid utf8 start byte
>      * The byte length of the utf8 character (the number of leading 1's,
> really)
>
> I believe returning -1 for continuation bytes makes utf8clen less error
> prone.
> The utf8_table4 array is no longer needed and has been removed.

utf8clen is used internally by R in more than a dozen places, and is 
likely used in packages as well.  Have you checked that this change in 
semantics won't break any of those uses?

Duncan Murdoch



More information about the R-devel mailing list