Daniel DeLorme
12/3/2007 1:41:00 AM
Greg Willits wrote:
> Greg Willits wrote:
>
>> I'm expecting a validate_format_of with a regex like this
>> /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
>> to allow many of the normal characters like ö é å to be submitted via
>> web form. However, the extended characters are being rejected.
>
>
> So, I've been pounding the web for info on UTF8 in Ruby and Rails the
> past couple days to concoct some validations that allow UTF8
> characters. I have discovered that I can get a little further by doing
> the
> following:
> - declaring $KCODE = 'UTF8'
> - adding /u to regex expressions.
>
> The only thing not working now is the ability to define a range of \x
> characters in a regex.
>
> So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
> to have an ä in it. Perfect.
>
> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
>
> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>
> I've boiled the experiments down to realizing I can't define a range
> with \x
>
> Is this just one of those things that just doesn't work yet WRT Ruby/
> Rails/UTF8, or is there another syntax? I've scoured all the regex
> docs I can find, and they seem to indicate a range should work.
Let me try to explain that in order to redeem myself from my previous
angry post.
Basically, \xE4 is counted as the byte value 0xE4, not the unicode
character U+00E4. And in a range expression, each escaped value is taken
as one character within the range. Which results in not-immediately
obvious situations:
>> 'aébvHögtåwH�FuG'.scan(/[\303\251]/u)
=> []
>> 'aébvHögtåwH�FuG'.scan(/[#{"\303\251"}]/u)
=> ["é"]
What is happening in the first case is that the string does not contain
characters \303 or \251 because those are invalid utf8 sequences. But
when the value "\303\251" is *inlined* into the regex, that is
recognized as the utf8 character "é" and a match is found.
So ranges *do* work in utf8 but you have to be careful:
>> "à âäçèéêîïôü".scan(/[ä-î]/u)
=> ["ä", "ç", "è", "é", "ê", "î"]
>> "à âäçèéêîïôü".scan(/[\303\244-\303\256]/u)
=> ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
"\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
"\264", "\303", "\274"]
>> "à âäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u)
=> ["ä", "ç", "è", "é", "ê", "î"]
Hope this helps.
Dan