[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Re: remove non-ASCII characters in a string

RubyTalk@gmail.com

1/22/2006 10:44:00 PM

What is considered an ascii char? i used the
http://www.lookupt... chart from ! to ~ or 33 .. 127

class String
def remove_nonascii(replacement)
n=self.split("")
self.slice!(0..self.size)
n.each{|b|
if b[0].to_i< 33 || b[0].to_i>127 then
self.concat(replacement)
else
self.concat(b)
end
}
self.to_s
end
end
require 'test/unit'

class TestAsciify < Test::Unit::TestCase
def test_asciify
assert_equal "Iñtërnâtiônàlizætiøn".remove_nonascii("?"),
"I?t?rn?ti?n?liz?ti?n"
assert_equal "Mötorhead".remove_nonascii("(removed)"), "M(removed)torhead"
end
end


On 1/22/06, Ezra Zygmuntowicz <ezmobius@gmail.com> wrote:
> I have a need for something like this as well. But I need to
> replace the chars with something plain ascii besides a placeholder.
> Any ideas how to do that?
>
> I ended up finding the escape codes for all the chars like "\322"
> and friends so I could replace say curly quotes with standard quotes
> and stuff like that.
>
> I will play with your code a bit and see if I can make it do what I
> want. Thanks for sharing it though.
>
> Cheers-
> -Ezra
>
>
>
> On Jan 22, 2006, at 10:41 AM, Levin Alexander wrote:
>
> > Hi,
> >
> > i needed a method to convert a piece of text to plain ascii and
> > replace all non-ascii chars with a placeholder. I could not find
> > anything in the stdlib so I wrote one.
> >
> > I'd love to hear your comments. (or pointers to existing libraries for
> > this task)
> >
> > -Levin
> >
> >
> > #!/usr/bin/ruby
> >
> > require 'iconv'
> >
> > class String
> >
> > # removes all characters which are not part of ascii
> > # and replaces them with +replacement+
> > #
> > # +replacement+ is supposed to be the same encoding as +source+
> > #
> > def asciify(replacement = "?", target = "ASCII", source = "UTF-8")
> > intermediate = "UCS-4"
> > pack_format = "N*"
> > i = Iconv.new(intermediate, source)
> >
> > u16s = i.iconv(self)
> > repl = i.iconv(replacement).unpack(pack_format)
> >
> > s = u16s.unpack(pack_format).collect { |codepoint|
> > codepoint < 128 ? codepoint : repl
> > }.flatten.pack(pack_format)
> >
> > return Iconv.new(target, intermediate).iconv(s)
> > end
> > end
> >
> > if __FILE__ == $0
> > require 'test/unit'
> >
> > class TestAsciify < Test::Unit::TestCase
> > def test_asciify
> > assert_equal "Iñtërnâtiônàlizætiøn".asciify, "I?t?rn?ti?n?liz?
> > ti?n"
> > assert_equal "Mötorhead".asciify("(removed)"), "M(removed)
> > torhead"
> > end
> > end
> > end
>
> -Ezra Zygmuntowicz
> WebMaster
> Yakima Herald-Republic Newspaper
> http://yakima...
> ezra@yakima-herald.com
> blog: http://b...
>
>
>
>


9 Answers

Levin Alexander

1/22/2006 11:11:00 PM

0

On 1/22/06, ruby talk <rubytalk@gmail.com> wrote:> What is considered an ascii char? i used the> http://www.lookupt... chart from ! to ~ or 33 .. 127I use everything <127 because I want to preserve tabs and linebreaks> def remove_nonascii(replacement)This does not work if the source text is UTF-8 encoded. On my machine: str = "ö" #=> "\303\266" str.remove_nonascii #=> "??"-Levin

Dave Burt

1/23/2006 1:19:00 AM

0

ruby talk wrote:
> What is considered an ascii char? i used the
> http://www.lookupt... chart from ! to ~ or 33 .. 127

Your range excludes ASCII character 32, space.
" ".remove_nonascii("?") #=> "?"

You also likely want to include characters like tabs and newlines, which are
in the 0-31 control range.

Levin's original version treats the original text as UTF-8. Is that part of
the requirements?

My version might look like this:

NON_ASCII = /[\x80-\xff]/
"Iñtërnâtiônàlizætiøn".gsub(NON_ASCII, "?") #=> "I?t?rn?ti?n?liz?ti?n"


Cheers,
Dave


pere.noel

1/23/2006 8:00:00 AM

0

Dave Burt <dave@burt.id.au> wrote:

> My version might look like this:
>
> NON_ASCII = /[\x80-\xff]/
> "Iñtërnâtiônàlizætiøn".gsub(NON_ASCII, "?") #=> "I?t?rn?ti?n?liz?ti?n"

i'd like not to remove no-ascii chars but replace all accentuated chars
(in an UTF-8 string) by them non-accentuated counterpart :

è => e
ä => a
ç => c

[...]

what is the best way to do that in Ruby?
--
une bévue

Dave Burt

1/23/2006 9:06:00 AM

0

"Une bévue" wrote:
> Dave Burt <dave@burt.id.au> wrote:
>
>> My version might look like this:
>>
>> NON_ASCII = /[\x80-\xff]/
>> "Iñtërnâtiônàlizætiøn".gsub(NON_ASCII, "?") #=> "I?t?rn?ti?n?liz?ti?n"
>
> i'd like not to remove no-ascii chars but replace all accentuated chars
> (in an UTF-8 string) by them non-accentuated counterpart :
>
> è => e
> ä => a
> ç => c
>
> [...]
>
> what is the best way to do that in Ruby?

Try the code below, translated from
http://stuffofinterest.com/misc/utf8-...

There may be a potential problem matching over character boundaries, but I
think UTF-8's unique starting bytes avoid the issue. So this should work.
For long strings, it could be slow. If I wanted speed, I'd probably do the
same thing in C and make it an extension.

Cheers,
Dave

class String
# Translate accented utf8 characters over to non-accented
def utf8_trans_unaccent
tranmap = {
"\xC3\x80" => "A", "\xC3\x81" => "A", "\xC3\x82" => "A", "\xC3\x83" =>
"A",
"\xC3\x84" => "A", "\xC3\x85" => "A", "\xC3\x86" => "AE","\xC3\x87" =>
"C",
"\xC3\x88" => "E", "\xC3\x89" => "E", "\xC3\x8A" => "E", "\xC3\x8B" =>
"E",
"\xC3\x8C" => "I", "\xC3\x8D" => "I", "\xC3\x8E" => "I", "\xC3\x8F" =>
"I",
"\xC3\x90" => "D", "\xC3\x91" => "N", "\xC3\x92" => "O", "\xC3\x93" =>
"O",
"\xC3\x94" => "O", "\xC3\x95" => "O", "\xC3\x96" => "O", "\xC3\x98" =>
"O",
"\xC3\x99" => "U", "\xC3\x9A" => "U", "\xC3\x9B" => "U", "\xC3\x9C" =>
"U",
"\xC3\x9D" => "Y", "\xC3\x9E" => "P", "\xC3\x9F" => "ss",
"\xC3\xA0" => "a", "\xC3\xA1" => "a", "\xC3\xA2" => "a", "\xC3\xA3" =>
"a",
"\xC3\xA4" => "a", "\xC3\xA5" => "a", "\xC3\xA6" => "ae","\xC3\xA7" =>
"c",
"\xC3\xA8" => "e", "\xC3\xA9" => "e", "\xC3\xAA" => "e", "\xC3\xAB" =>
"e",
"\xC3\xAC" => "i", "\xC3\xAD" => "i", "\xC3\xAE" => "i", "\xC3\xAF" =>
"i",
"\xC3\xB0" => "o", "\xC3\xB1" => "n", "\xC3\xB2" => "o", "\xC3\xB3" =>
"o",
"\xC3\xB4" => "o", "\xC3\xB5" => "o", "\xC3\xB6" => "o", "\xC3\xB8" =>
"o",
"\xC3\xB9" => "u", "\xC3\xBA" => "u", "\xC3\xBB" => "u", "\xC3\xBC" =>
"u",
"\xC3\xBD" => "y", "\xC3\xBE" => "p", "\xC3\xBF" => "y",
"\xC4\x80" => "A", "\xC4\x81" => "a", "\xC4\x82" => "A", "\xC4\x83" =>
"a",
"\xC4\x84" => "A", "\xC4\x85" => "a", "\xC4\x86" => "C", "\xC4\x87" =>
"c",
"\xC4\x88" => "C", "\xC4\x89" => "c", "\xC4\x8A" => "C", "\xC4\x8B" =>
"c",
"\xC4\x8C" => "C", "\xC4\x8D" => "c", "\xC4\x8E" => "D", "\xC4\x8F" =>
"d",
"\xC4\x90" => "D", "\xC4\x91" => "d", "\xC4\x92" => "E", "\xC4\x93" =>
"e",
"\xC4\x94" => "E", "\xC4\x95" => "e", "\xC4\x96" => "E", "\xC4\x97" =>
"e",
"\xC4\x98" => "E", "\xC4\x99" => "e", "\xC4\x9A" => "E", "\xC4\x9B" =>
"e",
"\xC4\x9C" => "G", "\xC4\x9D" => "g", "\xC4\x9E" => "G", "\xC4\x9F" =>
"g",
"\xC4\xA0" => "G", "\xC4\xA1" => "g", "\xC4\xA2" => "G", "\xC4\xA3" =>
"g",
"\xC4\xA4" => "H", "\xC4\xA5" => "h", "\xC4\xA6" => "H", "\xC4\xA7" =>
"h",
"\xC4\xA8" => "I", "\xC4\xA9" => "i", "\xC4\xAA" => "I", "\xC4\xAB" =>
"i",
"\xC4\xAC" => "I", "\xC4\xAD" => "i", "\xC4\xAE" => "I", "\xC4\xAF" =>
"i",
"\xC4\xB0" => "I", "\xC4\xB1" => "i", "\xC4\xB2" => "IJ","\xC4\xB3" =>
"ij",
"\xC4\xB4" => "J", "\xC4\xB5" => "j", "\xC4\xB6" => "K", "\xC4\xB7" =>
"k",
"\xC4\xB8" => "k", "\xC4\xB9" => "L", "\xC4\xBA" => "l", "\xC4\xBB" =>
"L",
"\xC4\xBC" => "l", "\xC4\xBD" => "L", "\xC4\xBE" => "l", "\xC4\xBF" =>
"L",
"\xC5\x80" => "l", "\xC5\x81" => "L", "\xC5\x82" => "l", "\xC5\x83" =>
"N",
"\xC5\x84" => "n", "\xC5\x85" => "N", "\xC5\x86" => "n", "\xC5\x87" =>
"N",
"\xC5\x88" => "n", "\xC5\x89" => "n", "\xC5\x8A" => "N", "\xC5\x8B" =>
"n",
"\xC5\x8C" => "O", "\xC5\x8D" => "o", "\xC5\x8E" => "O", "\xC5\x8F" =>
"o",
"\xC5\x90" => "O", "\xC5\x91" => "o", "\xC5\x92" => "CE","\xC5\x93" =>
"ce",
"\xC5\x94" => "R", "\xC5\x95" => "r", "\xC5\x96" => "R", "\xC5\x97" =>
"r",
"\xC5\x98" => "R", "\xC5\x99" => "r", "\xC5\x9A" => "S", "\xC5\x9B" =>
"s",
"\xC5\x9C" => "S", "\xC5\x9D" => "s", "\xC5\x9E" => "S", "\xC5\x9F" =>
"s",
"\xC5\xA0" => "S", "\xC5\xA1" => "s", "\xC5\xA2" => "T", "\xC5\xA3" =>
"t",
"\xC5\xA4" => "T", "\xC5\xA5" => "t", "\xC5\xA6" => "T", "\xC5\xA7" =>
"t",
"\xC5\xA8" => "U", "\xC5\xA9" => "u", "\xC5\xAA" => "U", "\xC5\xAB" =>
"u",
"\xC5\xAC" => "U", "\xC5\xAD" => "u", "\xC5\xAE" => "U", "\xC5\xAF" =>
"u",
"\xC5\xB0" => "U", "\xC5\xB1" => "u", "\xC5\xB2" => "U", "\xC5\xB3" =>
"u",
"\xC5\xB4" => "W", "\xC5\xB5" => "w", "\xC5\xB6" => "Y", "\xC5\xB7" =>
"y",
"\xC5\xB8" => "Y", "\xC5\xB9" => "Z", "\xC5\xBA" => "z", "\xC5\xBB" =>
"Z",
"\xC5\xBC" => "z", "\xC5\xBD" => "Z", "\xC5\xBE" => "z", "\xC6\x8F" =>
"E",
"\xC6\xA0" => "O", "\xC6\xA1" => "o", "\xC6\xAF" => "U", "\xC6\xB0" =>
"u",
"\xC7\x8D" => "A", "\xC7\x8E" => "a", "\xC7\x8F" => "I",
"\xC7\x90" => "i", "\xC7\x91" => "O", "\xC7\x92" => "o", "\xC7\x93" =>
"U",
"\xC7\x94" => "u", "\xC7\x95" => "U", "\xC7\x96" => "u", "\xC7\x97" =>
"U",
"\xC7\x98" => "u", "\xC7\x99" => "U", "\xC7\x9A" => "u", "\xC7\x9B" =>
"U",
"\xC7\x9C" => "u",
"\xC7\xBA" => "A", "\xC7\xBB" => "a", "\xC7\xBC" => "AE","\xC7\xBD" =>
"ae",
"\xC7\xBE" => "O", "\xC7\xBF" => "o",
"\xC9\x99" => "e",

"\xC2\x82" => ",", # High code comma
"\xC2\x84" => ",,", # High code double comma
"\xC2\x85" => "...", # Tripple dot
"\xC2\x88" => "^", # High carat
"\xC2\x91" => "\x27", # Forward single quote
"\xC2\x92" => "\x27", # Reverse single quote
"\xC2\x93" => "\x22", # Forward double quote
"\xC2\x94" => "\x22", # Reverse double quote
"\xC2\x96" => "-", # High hyphen
"\xC2\x97" => "--", # Double hyphen
"\xC2\xA6" => "|", # Split vertical bar
"\xC2\xAB" => "<<", # Double less than
"\xC2\xBB" => ">>", # Double greater than
"\xC2\xBC" => "1/4", # one quarter
"\xC2\xBD" => "1/2", # one half
"\xC2\xBE" => "3/4", # three quarters

"\xCA\xBF" => "\x27", # c-single quote
"\xCC\xA8" => "", # modifier - under curve
"\xCC\xB1" => "" # modifier - under line
}

tranmap.inject(self) do |str, (utf8, asc)|
p [utf8, asc]
str.gsub(utf8, asc)
end
end
end

"Iñtërnâtiônàlizætiøn".utf8_trans_unaccent #=> "Internationalizaetion"


pere.noel

1/23/2006 10:32:00 AM

0

Dave Burt <dave@burt.id.au> wrote:

> ry the code below, translated from
> http://stuffofinterest.com/misc/utf8-...
>
> There may be a potential problem matching over character boundaries, but I
> think UTF-8's unique starting bytes avoid the issue. So this should work.
> For long strings, it could be slow. If I wanted speed, I'd probably do the
> same thing in C and make it an extension.

thanks a lot this works great even with ligatures, i don't need speed
because i'll use that only for file names...
--
une bévue

dseverin

1/23/2006 11:03:00 AM

0

Ac??ñts, you say? What about these (incomplete list, and w/oligatures) :))))a, 69,AaªÀÁÂÃÄÅàáâãäåAaAaAaAaAa???????????????????????????????????????Å??Aab, 15, Bb????????B??Bbc, 23, CcÇçCcCcCcCc???CC????Ccd, 29,DdÐðDdÐd???????????????????Dde, 62,EeÈÉÊËèéêëEeEeEeEeEe???????????????????????????????????eE???Eef, 10, Ff???F??Ffg, 23, GgGgGgGgGgGg??????g??Ggh, 30,HhHhHh???????????????HHHh???Hhi, 46,IiÌÍÎÏìíîïIiIiIiIiIIi???????????????II??????Iij, 12, JjJjj?????Jjk, 19, KkKkKk????????K??Kkl, 28,LlLlLlLlLl??????????Ll????Llm, 17, Mm????????M????Mmn, 27,NnÑñNnNnNn???????????nN??Nno, 83,OoºÒÓÔÕÖØòóôõöøOoOoOoOoOoOoOo?????????????????????????????????????????????????o??Oop, 13, Pp??????P??Ppq, 7, QqQ??Qqr, 30,RrRrRrRr???????????????RRR??Rrs, 29,SsSsSsSsŠš?????????????????Sst, 23, TtTtTt???????????????Ttu, 69,UuÙÚÛÜùúûüUuUuUuUuUuUuUuUuUuUuUuUu?????????????????????????????????Uuv, 14, Vv??????????Vvw, 21, WwWw???????????????Wwx, 14, Xx??????????Xxy, 26,YyÝýÿYyŸ????????????????Yyz, 21, ZzZzZzŽž???????ZZ??Zz

Martin DeMello

1/23/2006 12:29:00 PM

0

Dave Burt <dave@burt.id.au> wrote:
>
> tranmap.inject(self) do |str, (utf8, asc)|
> p [utf8, asc]
> str.gsub(utf8, asc)
> end
> end
> end
>
> "Iñtërnâtiônàlizætiøn".utf8_trans_unaccent #=> "Internationalizaetion"

# one time preprocessing
INTL, ASC = "", ""
tranmap.each {|k,v|
INTL << k
ASC << v
}

# quicker than repeated gsubs:
str.tr(INTL, ASC)

martin

Dave Burt

1/23/2006 2:42:00 PM

0

Martin DeMello wrote:
> Dave Burt <dave@burt.id.au> wrote:
>>
>> tranmap.inject(self) do |str, (utf8, asc)|
>> p [utf8, asc]
>> str.gsub(utf8, asc)
>> end
>> end
>> end
>>
>> "Iñtërnâtiônàlizætiøn".utf8_trans_unaccent #=> "Internationalizaetion"
>
> # one time preprocessing
> INTL, ASC = "", ""
> tranmap.each {|k,v|
> INTL << k
> ASC << v
> }
>
> # quicker than repeated gsubs:
> str.tr(INTL, ASC)

Except that won't work, because tr only matches bytes, not multi-byte
characters.

(It might work after applying one of the Unicode string extensions that have
been floating around recently. But not in standard Ruby.)

Cheers,
Dave


Martin DeMello

1/23/2006 5:04:00 PM

0

Dave Burt <dave@burt.id.au> wrote:
>
> Except that won't work, because tr only matches bytes, not multi-byte
> characters.
>
> (It might work after applying one of the Unicode string extensions that have
> been floating around recently. But not in standard Ruby.)

Oh - didn't know that! Pretty sad. Thanks for the correction.

martin