[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Unicode and Character Classes -- a bug?

Richard Wiseman

9/19/2006 11:02:00 AM

Hi,

I've found some strange and unexpected behaviour to do with pattern
matching when I use Unicode. My example code follows and contains
comments to suggest what I think should happen:


$KCODE = 'u'
require 'jcode'

text = "\xa3A\nB\n\xa3C\nxD\nE"

# This pattern finds all lines that intuitively should match it.
puts "Pattern includes \"(?:x|\xa3)?\":"
text.scan(/^(?:x|\xa3)?[A-Z]$/).each {|s| puts s }

# This pattern finds all lines except the one containing the C, which is
# contrary to my intuition. I'd expect it to match all lines or, if I
were
# really paranoid about Unicode, I *might* expect it to match all but
the
# lines containing A and C.
puts "Pattern includes \"[x\xa3]\":"
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }


The output of this is:


Pattern includes "(?:x|ú)?":
úA
B
úC
xD
E
Pattern includes "[xú]":
úA
B
xD
E


Without the first two (Unicode-specifying) lines, the output is what I
expect:


Pattern includes "(?:x|ú)?":
úA
B
úC
xD
E
Pattern includes "[xú]":
úA
B
úC
xD
E


(Notice the extra line in the second half.) The thing I think is
bizarre is that if Unicode is being used, the ú matches ONLY where it's
the very first thing in the string.

Is there something funny about Unicode characters when using character
classes? Is this a known issue, or is it something weird and/or
ignorant that I'm doing?

Thanks!

Richard

--
Posted via http://www.ruby-....

8 Answers

MonkeeSage

9/19/2006 12:41:00 PM

0

Hi Richard,

It appears that you were spot-on with your guess about wonky things
happening in character classes. Seemingly hex escape codes aren't
allowed there. You'll have to either use a literal character, or if
that isn't possible, do something ugly like this:

/^[x#{"\xa3"}]?[A-Z]$/ # note the interpolation

There might be another solution, hopefully so, but this should at least
work if nothing else turns up.

Regards,
Jordan

Richard Wiseman

9/19/2006 1:55:00 PM

0

Jordan Callicoat wrote:
> /^[x#{"\xa3"}]?[A-Z]$/ # note the interpolation
>
> There might be another solution, hopefully so, but this should at least
> work if nothing else turns up.

I hadn't thought of that one - thanks for the suggestion! The simplest
(working) alternative I could think of was the parenthesised list of
individual characters as shown in the first half of the example code.

--
Posted via http://www.ruby-....

Daniel DeLorme

9/21/2006 12:01:00 AM

0

Richard Wiseman wrote:
> puts "Pattern includes \"[x\xa3]\":"
> text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }

That is very weird indeed. It's normal that your example doesn't work, because
\xa3 is NOT valid utf8. But I would've expected it to work if you used the
correct utf8 sequence for "ú" ("\xc3\xba"), except it doesn't!

$KCODE='u'
=> "u"
text = "\xc3\xbaA\nB\n\xc3\xbaC\nxD\nE"
=> "úA\nB\núC\nxD\nE"
text.scan(/^[xú]?[A-Z]$/)
=> ["úA", "B", "úC", "xD", "E"]
text.scan(/^[x\xc3\xba]?[A-Z]$/)
=> ["B", "xD", "E"]

WTF? Can anyone explain this?

MonkeeSage

9/21/2006 3:11:00 AM

0

Daniel DeLorme wrote:
> That is very weird indeed. It's normal that your example doesn't work, because
> \xa3 is NOT valid utf8. But I would've expected it to work if you used the
> correct utf8 sequence for "ú" ("\xc3\xba"), except it doesn't!

That shouldn't matter. He was matching the same hex escape he used in
his string (viz., \xa3). It shouldn't matter whether it's unicode or
just random data; the match should go through (or fail) in either case.

> WTF? Can anyone explain this?

Not really, because I don't understand Oniguruma (the regexp engine);
I'm barely smart enough to _use_ regexps. ;) But seemingly, you can't
use hex escapes in character classes, so you have to use the literal or
do other things to work around it (see last two posts above).

Regards,
Jordan

Verno Miller

9/21/2006 8:25:00 AM

0

>Jordan Callicoat wrote:
> Daniel DeLorme wrote:
>
> ...
>
>> WTF? Can anyone explain this?
>
> Not really, because I don't understand Oniguruma (the regexp engine);
> I'm barely smart enough to _use_ regexps. ;) But seemingly, you can't
> use hex escapes in character classes, so you have to use the literal or
> do other things to work around it (see last two posts above).
>
> Regards,
> Jordan


Just a pointer to some examples how to parse UTF-8 encoded strings in
Ruby:

http://bigbold.com/snippets/posts...


--
Posted via http://www.ruby-....

MonkeeSage

9/21/2006 8:51:00 AM

0

Verno Miller wrote:
> Just a pointer to some examples how to parse UTF-8 encoded strings in
> Ruby:

Hi Verno,

I used to have a class that used that technique to fake UTF-8 support.
I now use Nikolai Weibull's extension
(http://rubyforge.org/projects/char...).

Regards,
Jordan

Verno Miller

9/21/2006 9:38:00 AM

0

>Jordan Callicoat wrote:
> Verno Miller wrote:
>> Just a pointer to some examples how to parse UTF-8 encoded strings in
>> Ruby:
>
> Hi Verno,
>
> I used to have a class that used that technique to fake UTF-8 support.
> I now use Nikolai Weibull's extension
> (http://rubyforge.org/projects/char...).
>
> Regards,
> Jordan


Thanks for this one, Jordan! I seem to have missed some stuff on
redhanded as of late, esp.

http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAll...

For some info on Oniguruma btw I've run across this page:

http://www.geocities.jp/kosako3/oniguruma/...

I've played with the u option regex hack quite a while back (seemed to
be working pretty well even with some Japanese chars if i remember
correctly), so i just thought to throw it in as a tip.

Thanks, again, for the update to Nikolai Weibull's extension!

Cheers,
Verno

--
Posted via http://www.ruby-....

MonkeeSage

9/21/2006 9:58:00 AM

0


Verno Miller wrote:
> Thanks for this one, Jordan! I seem to have missed some stuff on
> redhanded as of late, esp.

NP :)

> For some info on Oniguruma btw I've run across this page:
>
> http://www.geocities.jp/kosako3/oniguruma/...

And thank YOU for this Verno! Oniguruma cheet sheet. That's sweet!! :D

Regards,
Jordan