[R] Matching names with non-English characters
Spencer Graves
spencer.graves at structuremonitoring.com
Mon May 13 18:05:50 CEST 2013
Hello:
How can one match names containing non-English characters that
appear differently in different but related data files? For example, I
have data on Raúl Grijalva, who represents the third district of Arizona
in the US House of Representatives. This first name appears as "Raúl"
in data read from one file and "Raul" from another.
The ideal would convert both "Raúl" and "Raúl" to "Raul". A
reasonable alternative would identify the non-English characters and
match on everything else ("^Ra" and "l$" in this case). The files all
contain state and district, so "AZ-3" could be part of the solution.
However, the file also contains data on Grijalva's predecessor in that
office, Ben Quayle, so "AZ-3" is not enough.
Thanks,
Spencer
p.s. My current data contains other similar cases, e.g.:
Recipient District
Raúl Grijalva AZ House 3
Tony Cárdenas CA House 29
Linda Sánchez CA House 38
Raúl Labrador ID House 1
André Carson IN House 7
Bob Menéndez NJ Senate
Ben Ray Luján NM House 3
José Serrano NY House 15
Nydia Velázquez NY House 7
Rubén Hinojosa TX House 15
These names all appear differently in another file I have. I've
written an ugly function that can identify "nonstandard characters".
I'm confident I can solve this problem. However, I'm adding things like
this to the Ecdat package, and it would be more useful for others if I
made better use of other capabilities in R.
More information about the R-help
mailing list