Asp Forum - character substitution using tr

Max Williams

4/22/2008 9:46:00 AM

I'm using a method that i found at the acts as ferret site:

http://projects.jkraemer.net/acts_as_ferret/#UT...

which is intended to strip accents out of strings, turning for example
"La BohÃ¨me" into "La Boheme". Here's the method:

def strip_diacritics(s)
# latin1 subset only
s.tr("Ã?ÃÃ?Ã?Ã?Ã?Ã?Ã?Ã?Ã?Ã?Ã?ÃÃ?ÃÃ?Ã?Ã?Ã?Ã?Ã?Ã?Ã?Ã?Ã?Ã?ÃÃ Ã¡Ã¢Ã£Ã¤Ã¥Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯Ã±Ã²Ã³Ã´ÃµÃ¶Ã¸Ã¹ÃºÃ»Ã¼Ã½Ã¿",
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
gsub(/Ã?/, "AE").
gsub(/Ã/, "Eth").
gsub(/Ã?/, "THORN").
gsub(/Ã?/, "ss").
gsub(/Ã¦/, "ae").
gsub(/Ã°/, "eth").
gsub(/Ã¾/, "thorn")
end

However, it's breaking for me: Ã¨ is turned into "yy". I think this is
to do with the number of bytes used: the first string passed to tr()
uses 2 bytes per character while the second uses 1 byte per character:

"Ã?ÃÃ?Ã?Ã?Ã?Ã?Ã?Ã?Ã?Ã?Ã?ÃÃ?ÃÃ?Ã?Ã?Ã?Ã?Ã?Ã?Ã?Ã?Ã?Ã?ÃÃ Ã¡Ã¢Ã£Ã¤Ã¥Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯Ã±Ã²Ã³Ã´ÃµÃ¶Ã¸Ã¹ÃºÃ»Ã¼Ã½Ã¿".size
=> 110

"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy".size
=> 55

Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.

thanks, max
--
Posted via http://www.ruby-....

4 Answers

Jan Dvorak

4/22/2008 4:44:00 PM

On Tuesday 22 April 2008 11:46:23 Max Williams wrote:
> I'm using a method that i found at the acts as ferret site:
>
> http://projects.jkraemer.net/acts_as_ferret/#UT...
>
> which is intended to strip accents out of strings, turning for example
> "La Boh=C3=A8me" into "La Boheme". Here's the method:
>
> def strip_diacritics(s)
> # latin1 subset only
> s.tr("=C3=80=C3=81=C3=82=C3=83=C3=84=C3=85=C3=87=C3=88=C3=89=C3=8A=C3=
=8B=C3=8C=C3=8D=C3=8E=C3=8F=C3=91=C3=92=C3=93=C3=94=C3=95=C3=96=C3=98=C3=99=
=C3=9A=C3=9B=C3=9C=C3=9D=C3=A0=C3=A1=C3=A2=C3=A3=C3=A4=C3=A5=C3=A7=C3=A8=C3=
=A9=C3=AA=C3=AB=C3=AC=C3=AD=C3=AE=C3=AF=C3=B1=C3=B2=C3=B3=C3=B4=C3=B5=C3=B6=
=C3=B8=C3=B9=C3=BA=C3=BB=C3=BC=C3=BD=C3=BF",
> "AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
> gsub(/=C3=86/, "AE").
> gsub(/=C3=90/, "Eth").
> gsub(/=C3=9E/, "THORN").
> gsub(/=C3=9F/, "ss").
> gsub(/=C3=A6/, "ae").
> gsub(/=C3=B0/, "eth").
> gsub(/=C3=BE/, "thorn")
> end
>

With ruby 1.9 your code works fine without modifications, with ruby 1.8 and=
=20
it's support for unicode (or lack of thereof) it might be quite a problem t=
o=20
get it working.

> Assuming this is the problem, can anyone tell me how to get around it?
> I know next to nothing about character encoding: i tried converting both
> translation strings to utf8 with String#toutf8, but that didn't make any
> difference.

UTF-8 is variable length encoding, the first half of ascii (which includes=
=20
a-zA-Z) is not encoded at all (=3D1 byte), anything other is encoded as 2-4=
=20
byte chars. Both of the strings are therefore valid UTF-8, but ruby 1.8's t=
r=20
can't operate on character level, only on byte level.

Jan

Max Williams

4/22/2008 5:06:00 PM

Jan Dvorak wrote:

> With ruby 1.9 your code works fine without modifications, with ruby 1.8
> and
> it's support for unicode (or lack of thereof) it might be quite a
> problem to
> get it working.
>

ah...i'm a bit scared to change our project over to ruby 1.9 (i didn't
know there was a 1.9) to solve this problem. I ended up just picking
the most commonly used accents and doing individual gsubs on the strings
to swap them out. Feels dirty but it works.

Thanks a lot for the info!
max

--
Posted via http://www.ruby-....

Sebastian Hungerecker

4/22/2008 5:28:00 PM

Max Williams wrote:
> However, it's breaking for me: =C3=A8 is turned into "yy".

It works if you require 'jcode' first.

HTH,
Sebastian
=2D-=20
NP: Depeche Mode - The Things You Said
Jabber: sepp2k@jabber.org
ICQ: 205544826

Max Williams

4/23/2008 8:49:00 AM

Sebastian Hungerecker wrote:
> Max Williams wrote:
>> However, it's breaking for me: Ã¨ is turned into "yy".
>
> It works if you require 'jcode' first.
>
> HTH,
> Sebastian

Perfect, thanks! That's much more palatable than upgrading ruby.

cheers
max
--
Posted via http://www.ruby-....

comp.lang.ruby

character substitution using tr

Max Williams

Jan Dvorak

Max Williams

Sebastian Hungerecker

Max Williams

x Login to ForumsZone