Robin Stocker
2/21/2006 1:54:00 PM
Xavier Noria wrote:
> I wrote this method
>
> def self.normalize_for_sorting(s)
> return nil unless s
> norm = s.downcase
> norm.tr!('ÁÉÍÓÚ', 'aeiou')
> norm.tr!('ÀÈÌÒÙ', 'aeiou')
> norm.tr!('ÄËÏÖÜ', 'aeiou')
> norm.tr!('ÂÊÎÔÛ', 'aeiou')
> norm.tr!('áéíóú', 'aeiou')
> norm.tr!('àèìòù', 'aeiou')
> norm.tr!('äëïöü', 'aeiou')
> norm.tr!('âêîôû', 'aeiou')
> norm
> end
>
> to normalize strings for sorting. This script is UTF-8, everything is
> UTF-8 in my application, $KCODE is 'u'.
>
> But it does not work, examples:
>
> Andrés -> andruos
> López -> luupez
> Pérez -> puorez
>
> I tried to "force" it with Iconv.conv('UTF-8', 'ASCII', 'aeiou') to no
> avail. Any ideas?
>
> -- fxn
Hi,
My guess is that the "tr" method treats its arguments as a string of
bytes. And because characters with accents need more than 1 byte in
UTF-8, #tr doesn't do what you would expect it to. (It's not even tr's
fault, how is it supposed to know that two bytes actually represent a
single character?)
The solution is not to use #tr!, but #gsub!. It isn't as short, but at
least it's right ;)
norm.gsub!('ä', 'a')
norm.gsub!('ë', 'e')
# and so on...
And because that is against DRY (Don't Repeat Yourself), I would
recommend storing the mapping as a hash:
accents = { 'ä' => 'a', 'ë' => 'e', ... }
accents.each do |accent, replacement|
norm.gsub!(accent, replacement)
end
Regards,
Robin Stocker