Asp Forum - Unicode in Regex

Greg Willits

11/30/2007 8:18:00 PM

This is mostly a Ruby thing, and partly a Rails thing.

I'm expecting a validate_format_of with a regex like this

/^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/

to allow many of the normal characters like Ã¶ Ã© Ã¥ to be submitted via
web form.

However, the extended characters are being rejected.

This works just fine though (which is just a-zA-Z)

/^[\x41-\x5A\x61-\x7A\.\'\-\ ]*?$/

It also seems to fail with full \x0000 numbers, is there limit at \xFF?

Some plain Ruby tests seem to suggest unicode characters don't work at
all??

p 'abvHgtwHFuG'.scan(/[a-z]/)
p 'abvHgtwHFuG'.scan(/[A-Z]/)
p 'abvHgtwHFuG'.scan(/[\x41-\x5A]/)
p 'abvHgtwHFuG'.scan(/[\x61-\x7A]/)
p 'aÃ©bvHÃ¶gtÃ¥wHÃ?FuG'.scan(/[\xC0-\xD6\xD9-\xF6\xF9-\xFF]/)

["a", "b", "v", "g", "t", "w", "u"]
["H", "H", "F", "G"]
["H", "H", "F", "G"]
["a", "b", "v", "g", "t", "w", "u"]
["\303", "\303", "\303", "\303"]

So, what's the secret to using unicode character ranges in Ruby regex
(or Rails validations)?

--
def gw
acts_as_n00b
writes_at(www.railsdev.ws)
end
--
Posted via http://www.ruby-....

32 Answers

Dale Martenson

11/30/2007 9:05:00 PM

On Nov 30, 2:18 pm, Greg Willits <li...@gregwillits.ws> wrote:

> So, what's the secret to using unicode character ranges in Ruby regex
> (or Rails validations)?

Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006
Ruby Conference. His presentation can be found at:

http://www.tbray.org/talks/rubyco...

He described how many member functions have trouble dealing with these
character sets. He made special reference to regular expressions.

--Dale

Greg Willits

11/30/2007 10:01:00 PM

Dale Martenson wrote:
> On Nov 30, 2:18 pm, Greg Willits <li...@gregwillits.ws> wrote:
>
>> So, what's the secret to using unicode character ranges in Ruby regex
>> (or Rails validations)?
>
> Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006
> Ruby Conference. His presentation can be found at:
>
> http://www.tbray.org/talks/rubyco...
>
> He described how many member functions have trouble dealing with these
> character sets. He made special reference to regular expressions.

That's just beyond sad.

I've been using Lasso for several years now, and *2003* it provided
complete support for Unicode. I know there's some esoterics it may not
deal with, but for all practical purposes we can round-trip data in
western and eastern languages with Lasso quite easily.

How can all these other languages be so far behind?

Pretty bad if I can't even allow Mr. MuÃ±os or GÃ¶ran to enter their names
in a web form with proper server side validations. Aargh.

-- gw
--
Posted via http://www.ruby-....

MonkeeSage

12/1/2007 5:25:00 AM

On Nov 30, 4:00 pm, Greg Willits <li...@gregwillits.ws> wrote:
> Dale Martenson wrote:
> > On Nov 30, 2:18 pm, Greg Willits <li...@gregwillits.ws> wrote:
>
> >> So, what's the secret to using unicode character ranges in Ruby regex
> >> (or Rails validations)?
>
> > Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006
> > Ruby Conference. His presentation can be found at:
>
> >http://www.tbray.org/talks/rubyco...
>
> > He described how many member functions have trouble dealing with these
> > character sets. He made special reference to regular expressions.
>
> That's just beyond sad.
>
> I've been using Lasso for several years now, and *2003* it provided
> complete support for Unicode. I know there's some esoterics it may not
> deal with, but for all practical purposes we can round-trip data in
> western and eastern languages with Lasso quite easily.
>
> How can all these other languages be so far behind?
>
> Pretty bad if I can't even allow Mr. Muños or Göran to enter their names
> in a web form with proper server side validations. Aargh.
>
> -- gw
> --
> Posted viahttp://www.ruby-....

Ruby 1.8 doesn't have unicode support (1.9 is starting to get it).
Everything in ruby is a bytestring.

irb(main):001:0> 'aébvHögtåwHÅFuG'.scan(/./)
=> ["a", "\303", "\251", "b", "v", "H", "\303", "\266", "g", "t",
"\303", "\245", "w", "H", "\303", "\205", "F", "u", "G"]

So your character class is matching the first byte of the composite
characters (which is \303 in octal), and skipping the next (since it's
below the range). You probably want something like...

reg = /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
'aébvHögtåwHÅFuG'.scan(reg)

irb(main):006:0* reg = /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
=> /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
irb(main):007:0> 'aébvHögtåwHÅFuG'.scan(reg)
=> ["\303\251", "\303\266", "\303\245", "\303\205"]
irb(main):008:0> "å" == "\303\245"
=> true

Ps. I'm not entirely sure the value of the second character class is
right.

Regards,
Jordan

Jimmy Kofler

12/1/2007 10:17:00 AM

> Unicode in Regex
> Posted by Greg Willits (-gw-) on 30.11.2007 21:18
> This is mostly a Ruby thing, and partly a Rails thing.
>
> I'm expecting a validate_format_of with a regex like this
>
> /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
>
> to allow many of the normal characters like Ã¶ Ã© Ã¥ to be submitted via
> web form.

How about the utf8 validation regex here:
http://snippets.dzone.com/posts... ?
--
Posted via http://www.ruby-....

Greg Willits

12/2/2007 8:36:00 PM

Greg Willits wrote:

> I'm expecting a validate_format_of with a regex like this
> /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
> to allow many of the normal characters like Ã¶ Ã© Ã¥ to be submitted via
> web form. However, the extended characters are being rejected.

So, I've been pounding the web for info on UTF8 in Ruby and Rails the
past couple days to concoct some validations that allow UTF8
characters. I have discovered that I can get a little further by doing
the
following:
- declaring $KCODE = 'UTF8'
- adding /u to regex expressions.

The only thing not working now is the ability to define a range of \x
characters in a regex.

So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
to have an Ã¤ in it. Perfect.

But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u

But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u

I've boiled the experiments down to realizing I can't define a range
with \x

Is this just one of those things that just doesn't work yet WRT Ruby/
Rails/UTF8, or is there another syntax? I've scoured all the regex
docs I can find, and they seem to indicate a range should work.

For now, I just have all the characters I want included < \xFF listed
individually.

utf_accents = '\xC0\xC1\xC2\.......'

Is_person_name = /^[a-zA-Z#{utf_accents}\.\'\-\ ]*?$/u

But I'd like to solve the range notation if I can.

--
def gw
acts_as_n00b
writes_at(www.railsdev.ws)
end
--
Posted via http://www.ruby-....

Daniel DeLorme

12/3/2007 1:19:00 AM

MonkeeSage wrote:
> Ruby 1.8 doesn't have unicode support (1.9 is starting to get it).

I enrages me to see this kind of FUD. Through regular expressions, ruby
1.8 has 80-90% complete utf8 support. And oniguruma makes utf8 support
well-near 100% complete.

>> 'aÃ©bvHÃ¶gtÃ¥wHÃ?FuG'.scan(/./u)
=> ["a", "Ã©", "b", "v", "H", "Ã¶", "g", "t", "Ã¥", "w", "H", "Ã?", "F",
"u", "G"]

>> 'aÃ©bvHÃ¶gtÃ¥wHÃ?FuG'.scan(/[Ã©Ã¶Ã¥Ã?]/u)
=> ["Ã©", "Ã¶", "Ã¥", "Ã?"]

Ok, sometimes you have to take a weird approach because of the missing
10-20%, but it's still workable
>> 'aÃ©bvHÃ¶gtÃ¥wHÃ?FuG'.scan(/(?:\303\251|\303\266|\303\245|\303\205)/u)
=> ["Ã©", "Ã¶", "Ã¥", "Ã?"]

> Everything in ruby is a bytestring.

YES! And that's exactyly how it should be. Who is it that spread the
flawed idea that strings are fundamentally made of characters? I'd like
to slap him around a little. Fundamentally, ever since the word "string"
was applied to computing, strings were made of 8-BIT CHARS, not n-bit
characters. If only the creators of C has called that datatype "byte"
instead of "char" it would have saved us so many misunderstandings.

Usually the complaint about the support lack of unicode support is that
something like "æ?¥æ?¬èª?".length returns 9 instead of 3, or that "æ?¥æ?¬èª?
".index("èª?") returns 6 instead of 2. It's nice that people want to
completely redefine the API to return character positions and all that,
but please don't complain that it's broken just because you happen to be
using it incorrectly. Use the right tool for the job. SQL for database
queries, non-home-brewed crypto libraries for security, regular
expressions for string manipulation.

I'm terribly sorry for the rant but I had to get it off my chest.

Dan

Daniel DeLorme

12/3/2007 1:41:00 AM

Greg Willits wrote:
> Greg Willits wrote:
>
>> I'm expecting a validate_format_of with a regex like this
>> /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
>> to allow many of the normal characters like Ã¶ Ã© Ã¥ to be submitted via
>> web form. However, the extended characters are being rejected.
>
>
> So, I've been pounding the web for info on UTF8 in Ruby and Rails the
> past couple days to concoct some validations that allow UTF8
> characters. I have discovered that I can get a little further by doing
> the
> following:
> - declaring $KCODE = 'UTF8'
> - adding /u to regex expressions.
>
> The only thing not working now is the ability to define a range of \x
> characters in a regex.
>
> So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
> to have an Ã¤ in it. Perfect.
>
> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
>
> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>
> I've boiled the experiments down to realizing I can't define a range
> with \x
>
> Is this just one of those things that just doesn't work yet WRT Ruby/
> Rails/UTF8, or is there another syntax? I've scoured all the regex
> docs I can find, and they seem to indicate a range should work.

Let me try to explain that in order to redeem myself from my previous
angry post.

Basically, \xE4 is counted as the byte value 0xE4, not the unicode
character U+00E4. And in a range expression, each escaped value is taken
as one character within the range. Which results in not-immediately
obvious situations:

>> 'aÃ©bvHÃ¶gtÃ¥wHÃ?FuG'.scan(/[\303\251]/u)
=> []
>> 'aÃ©bvHÃ¶gtÃ¥wHÃ?FuG'.scan(/[#{"\303\251"}]/u)
=> ["Ã©"]

What is happening in the first case is that the string does not contain
characters \303 or \251 because those are invalid utf8 sequences. But
when the value "\303\251" is *inlined* into the regex, that is
recognized as the utf8 character "Ã©" and a match is found.

So ranges *do* work in utf8 but you have to be careful:

>> "Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[Ã¤-Ã®]/u)
=> ["Ã¤", "Ã§", "Ã¨", "Ã©", "Ãª", "Ã®"]
>> "Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[\303\244-\303\256]/u)
=> ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
"\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
"\264", "\303", "\274"]
>> "Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[#{"\303\244-\303\256"}]/u)
=> ["Ã¤", "Ã§", "Ã¨", "Ã©", "Ãª", "Ã®"]

Hope this helps.

Dan

MonkeeSage

12/3/2007 1:47:00 AM

On Dec 2, 2:35 pm, Greg Willits <li...@gregwillits.ws> wrote:
> Greg Willits wrote:
> > I'm expecting a validate_format_of with a regex like this
> > /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
> > to allow many of the normal characters like ö é å to be submitted via
> > web form. However, the extended characters are being rejected.
>
> So, I've been pounding the web for info on UTF8 in Ruby and Rails the
> past couple days to concoct some validations that allow UTF8
> characters. I have discovered that I can get a little further by doing
> the
> following:
> - declaring $KCODE = 'UTF8'
> - adding /u to regex expressions.
>
> The only thing not working now is the ability to define a range of \x
> characters in a regex.
>
> So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
> to have an ä in it. Perfect.
>
> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
>
> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>
> I've boiled the experiments down to realizing I can't define a range
> with \x
>
> Is this just one of those things that just doesn't work yet WRT Ruby/
> Rails/UTF8, or is there another syntax? I've scoured all the regex
> docs I can find, and they seem to indicate a range should work.
>
> For now, I just have all the characters I want included < \xFF listed
> individually.
>
> utf_accents = '\xC0\xC1\xC2\.......'
>
> Is_person_name = /^[a-zA-Z#{utf_accents}\.\'\-\ ]*?$/u
>
> But I'd like to solve the range notation if I can.
>
> --
> def gw
> acts_as_n00b
> writes_at(www.railsdev.ws)
> end
> --
> Posted viahttp://www.ruby-....

This seems to work...

$KCODE = "UTF8"
p /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?/u =~ "Jäsp...it
works"
# => 0

However, it looks to me like it would be more robust to use a slightly
modified version of UTF8REGEX (found in the link Jimmy posted
above)...

UTF8REGEX = /\A(?:
[a-zA-Z\.\-\'\ ]
| [\xC2-\xDF][\x80-\xBF]
| \xE0[\xA0-\xBF][\x80-\xBF]
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| \xF0[\x90-\xBF][\x80-\xBF]{2}
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)*\z/mnx

p UTF8REGEX =~ "Jäsp...it works here too"
# => 0

Look at the link to see the explanation of the alternations.

Regards,
Jordan

Daniel DeLorme

12/3/2007 1:56:00 AM

Greg Willits wrote:
> So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
> to have an Ã¤ in it. Perfect.

If that actually works, it means you are really using ISO-8859-1
strings, not UTF-8.

> utf_accents = '\xC0\xC1\xC2\.......'

Nope, that's not UTF-8. UTF-8 characters Ã?ÃÃ? would look like
utf_accents = "\xC3\x80\xC3\x81\xC3\x82..."

Dan

Greg Willits

12/3/2007 7:47:00 PM

Daniel DeLorme wrote:
> Greg Willits wrote:

>> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
>> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>>
>> I've boiled the experiments down to realizing I can't define a range
>> with \x

> Let me try to explain that in order to redeem myself from my previous
> angry post.

:-)

> Basically, \xE4 is counted as the byte value 0xE4, not the unicode
> character U+00E4. And in a range expression, each escaped value is taken
> as one character within the range. Which results in not-immediately
> obvious situations:
>
> >> 'aÃ©bvHÃ¶gtÃ¥wHÃ?FuG'.scan(/[\303\251]/u)
> => []
> >> 'aÃ©bvHÃ¶gtÃ¥wHÃ?FuG'.scan(/[#{"\303\251"}]/u)
> => ["Ã©"]

OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a
character code point -- which with your explanation I can finally tie
together what that means.

Took me a second to recognize the #{} as Ruby and not some new regex I'd
never seen :-P

And I realize now too I wasn't picking up on the use of octal vs
decimal.

Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant?

> What is happening in the first case is that the string does not contain
> characters \303 or \251 because those are invalid utf8 sequences. But
> when the value "\303\251" is *inlined* into the regex, that is
> recognized as the utf8 character "Ã©" and a match is found.
>
> So ranges *do* work in utf8 but you have to be careful:
>
> >> "Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[Ã¤-Ã®]/u)
> => ["Ã¤", "Ã§", "Ã¨", "Ã©", "Ãª", "Ã®"]
> >> "Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[\303\244-\303\256]/u)
> => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
> "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
> "\264", "\303", "\274"]
> >> "Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[#{"\303\244-\303\256"}]/u)
> => ["Ã¤", "Ã§", "Ã¨", "Ã©", "Ãª", "Ã®"]
>
> Hope this helps.

Yes!

-- gw
--
Posted via http://www.ruby-....

comp.lang.ruby

Unicode in Regex

Greg Willits

Dale Martenson

Greg Willits

MonkeeSage

Jimmy Kofler

Greg Willits

Daniel DeLorme

Daniel DeLorme

MonkeeSage

Daniel DeLorme

Greg Willits

x Login to ForumsZone