[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Puzzling regex behaviour

Ian Macdonald

2/13/2007 8:20:00 PM

Hello,

Can anyone explain this to me?

$ echo $LANG
nl_NL
$ irb -f
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil
irb(main):003:0> foo =~ /\W/
=> 2

First question: Why does the final statement return 2 instead of nil?
All characters in foo are alphabetic characters in this locale.

Then:

$ echo $LANG
nl_NL
$ cat ./foo
#!/usr/bin/ruby -w

foo = "préférées"
p foo =~ /[^[:alnum:]]/
p foo =~ /\W/
$ ./foo
2
2

Huh?

Second question: Why does the first regex match now return 2 instead of
nil?

To my way of thinking, both statements should always return nil, whether
or not they are typed into irb or run in a stand-alone script. At the
very least, both statements should return the same answer, regardless of
the context.

What am I missing here?

Ian
--
Ian Macdonald | tachyon emissions overloading the system
ian@caliban.org |
http://www.ca... |
|
|

23 Answers

Robert Klemme

2/13/2007 9:40:00 PM

0

On 13.02.2007 21:19, Ian Macdonald wrote:
> Hello,
>
> Can anyone explain this to me?
>
> $ echo $LANG
> nl_NL
> $ irb -f
> irb(main):001:0> foo = "préférées"
> => "pr\351f\351r\351es"
> irb(main):002:0> foo =~ /[^[:alnum:]]/
> => nil
> irb(main):003:0> foo =~ /\W/
> => 2
>
> First question: Why does the final statement return 2 instead of nil?
> All characters in foo are alphabetic characters in this locale.
>
> Then:
>
> $ echo $LANG
> nl_NL
> $ cat ./foo
> #!/usr/bin/ruby -w
>
> foo = "préférées"
> p foo =~ /[^[:alnum:]]/
> p foo =~ /\W/
> $ ./foo
> 2
> 2
>
> Huh?
>
> Second question: Why does the first regex match now return 2 instead of
> nil?
>
> To my way of thinking, both statements should always return nil, whether
> or not they are typed into irb or run in a stand-alone script. At the
> very least, both statements should return the same answer, regardless of
> the context.
>
> What am I missing here?

Maybe there is an initialization in .irbrc that leads to a changed
locale inside IRB. Or your IRB belongs to a different Ruby version on
that system.

Other than that, I guess you tripped into the wide and wild country of
i18n - many strange things can be found there. Maybe \w and \W only
treat ASCII [a-z] characters as word characters.

Kind regards

robert

Ian Macdonald

2/13/2007 10:08:00 PM

0

On Wed 14 Feb 2007 at 06:45:08 +0900, Robert Klemme wrote:

> Maybe there is an initialization in .irbrc that leads to a changed
> locale inside IRB.

Nope; I had hoped it would be that easy, but as you can see from my
snippet of output, I started irb with -f, which bypasses ~/.irbrc.
ENV['LANG'] also prints nl_NL in irb, so that can't be it.

> Or your IRB belongs to a different Ruby version on that system.

I compiled it myself, so there has been no mix-and-matching.

> Other than that, I guess you tripped into the wide and wild country of
> i18n - many strange things can be found there. Maybe \w and \W only
> treat ASCII [a-z] characters as word characters.

It does seem that way, as Perl also appears to treat them this way.

However, I'm still puzzled why there's a difference between irb and a
stand-alone script.

Ian
--
Ian Macdonald | If you are what you eat, I guess that makes
ian@caliban.org | me a cheese danish. -- Anonymous
http://www.ca... |
|
|

David Balmain

2/13/2007 10:53:00 PM

0

On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> However, I'm still puzzled why there's a difference between irb and a
> stand-alone script.

Maybe your editor saves the script in UTF-8 format. The irb example
clearly encodes the string in ISO-8859-1. That could explain the
difference.

--
Dave Balmain
http://www.daveba...

David Balmain

2/13/2007 11:01:00 PM

0

On 2/14/07, David Balmain <dbalmain.ml@gmail.com> wrote:
> On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> > However, I'm still puzzled why there's a difference between irb and a
> > stand-alone script.
>
> Maybe your editor saves the script in UTF-8 format. The irb example
> clearly encodes the string in ISO-8859-1. That could explain the
> difference.

For example;

~$ echo $LANG
en_US.ISO-8859-1
~$ irb -f
irb(main):001:0> "pr\351f\351r\351es" =~ /[^[:alnum:]]/
=> nil
irb(main):002:0> "pr\303\251f\303\251r\303\251es" =~ /[^[:alnum:]]/
=> 3

Not exactly what you had but it probably has something to do with the
encoding of the é.

--
Dave Balmain
http://www.daveba...

Ian Macdonald

2/13/2007 11:43:00 PM

0

On Wed 14 Feb 2007 at 08:01:15 +0900, David Balmain wrote:

> On 2/14/07, David Balmain <dbalmain.ml@gmail.com> wrote:
> >On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> >> However, I'm still puzzled why there's a difference between irb and a
> >> stand-alone script.
> >
> >Maybe your editor saves the script in UTF-8 format. The irb example
> >clearly encodes the string in ISO-8859-1. That could explain the
> >difference.
>
> For example;
>
> ~$ echo $LANG
> en_US.ISO-8859-1
> ~$ irb -f
> irb(main):001:0> "pr\351f\351r\351es" =~ /[^[:alnum:]]/
> => nil
> irb(main):002:0> "pr\303\251f\303\251r\303\251es" =~ /[^[:alnum:]]/
> => 3
>
> Not exactly what you had but it probably has something to do with the
> encoding of the é.

My editor is vim and I run it in the nl_NL locale, so it doesn't start
in UTF-8 mode. To double-check:

:set encoding?
encoding=latin1

And if we dump my little script:

$ od -c foo
0000000 # ! / u s r / b i n / r u b y
0000020 - w \n \n f o o = " p r 351 f 351
0000040 r 351 e s " \n p f o o = ~ /
0000060 [ ^ [ : a l n u m : ] ] / \n p
0000100 f o o = ~ / \ W / \n

You can see that it is, indeed, saved as Latin-1, not UTF-8.

The mystery continues. ;-)

Ian
--
Ian Macdonald | It's not whether you win or lose, it's how
ian@caliban.org | you place the blame.
http://www.ca... |
|
|

Ian Macdonald

2/14/2007

0

On Wed 14 Feb 2007 at 08:43:06 +0900, Ian Macdonald wrote:

> The mystery continues. ;-)

I should have asked by now, but can anyone else reproduce this with
Ruby 1.8.5?

Ian
--
Ian Macdonald | The Gordian Maxim: If a string has one
ian@caliban.org | end, it has another.
http://www.ca... |
|
|

David Balmain

2/14/2007 12:08:00 AM

0

On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> On Wed 14 Feb 2007 at 08:43:06 +0900, Ian Macdonald wrote:
>
> > The mystery continues. ;-)
>
> I should have asked by now, but can anyone else reproduce this with
> Ruby 1.8.5?

I can reproduce this 1.8.4

--
Dave Balmain
http://www.daveba...

Ian Macdonald

2/14/2007 12:13:00 AM

0

On Wed 14 Feb 2007 at 09:08:17 +0900, David Balmain wrote:

> On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> >
> >I should have asked by now, but can anyone else reproduce this with
> >Ruby 1.8.5?
>
> I can reproduce this 1.8.4

Just to be clear, you are confirming that the following code:

foo = "préférées"
p foo =~ /[^[:alnum:]]/

prints nil in irb and 2 in a stand-alone script when in both cases your
locale is preset to nl_NL?

Ian
--
Ian Macdonald | On a clear disk you can seek forever.
ian@caliban.org |
http://www.ca... |
|
|

David Balmain

2/14/2007 12:19:00 AM

0

On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> On Wed 14 Feb 2007 at 09:08:17 +0900, David Balmain wrote:
>
> > On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> > >
> > >I should have asked by now, but can anyone else reproduce this with
> > >Ruby 1.8.5?
> >
> > I can reproduce this 1.8.4
>
> Just to be clear, you are confirming that the following code:
>
> foo = "préférées"
> p foo =~ /[^[:alnum:]]/
>
> prints nil in irb and 2 in a stand-alone script when in both cases your
> locale is preset to nl_NL?

Not nl_NL but en_US.ISO-8859-1. I get the same results as you.

--
Dave Balmain
http://www.daveba...

Rob Biedenharn

2/14/2007 1:25:00 AM

0


On Feb 13, 2007, at 7:13 PM, Ian Macdonald wrote:

> On Wed 14 Feb 2007 at 09:08:17 +0900, David Balmain wrote:
>
>> On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
>>>
>>> I should have asked by now, but can anyone else reproduce this with
>>> Ruby 1.8.5?
>>
>> I can reproduce this 1.8.4
>
> Just to be clear, you are confirming that the following code:
>
> foo = "préférées"
> p foo =~ /[^[:alnum:]]/
>
> prints nil in irb and 2 in a stand-alone script when in both cases
> your
> locale is preset to nl_NL?
>
> Ian
> --
> Ian Macdonald | On a clear disk you can seek forever.
> ian@caliban.org |
> http://www.ca... |

I'm beginning to wonder if the original question is even accurate.
Doing nothing more than changing the encoding and re-saving the file
(where the value for foo was a cut-n-paste from the email), there
doesn't seem to be any discrpeancy between ruby and irb. (This
output is from ruby 1.8.5, but 1.8.2 was the same)

rab:code/ruby $ file regexp_and_alnum_versus_w.rb
regexp_and_alnum_versus_w.rb: ISO-8859 text
rab:code/ruby $ cat regexp_and_alnum_versus_w.rb
foo = "pr?f?r?es"
alnum = /[^[:alnum:]]/
dubya = /\W/

puts "foo\n => #{foo.inspect}"
[ alnum, dubya ].each do |re|
puts "foo =~ #{re}\n => #{foo =~ re}"
end
rab:code/ruby $ ruby regexp_and_alnum_versus_w.rb
foo
=> "pr\351f\351r\351es"
foo =~ (?-mix:[^[:alnum:]])
=> 2
foo =~ (?-mix:\W)
=> 2
rab:code/ruby $ irb -r regexp_and_alnum_versus_w.rb
foo
=> "pr\351f\351r\351es"
foo =~ (?-mix:[^[:alnum:]])
=> 2
foo =~ (?-mix:\W)
=> 2
>> eixt
NameError: undefined local variable or method `eixt' for main:Object
from (irb):1
>> exit
rab:code/ruby $ file
regexp_and_alnum_versus_w.rbregexp_and_alnum_versus_w.rb: UTF-8
Unicode text
rab:code/ruby $ cat regexp_and_alnum_versus_w.rb
foo = "préférées"
alnum = /[^[:alnum:]]/
dubya = /\W/

puts "foo\n => #{foo.inspect}"
[ alnum, dubya ].each do |re|
puts "foo =~ #{re}\n => #{foo =~ re}"
end
rab:code/ruby $ ruby regexp_and_alnum_versus_w.rb
foo
=> "pr\303\251f\303\251r\303\251es"
foo =~ (?-mix:[^[:alnum:]])
=> 2
foo =~ (?-mix:\W)
=> 2
rab:code/ruby $ irb -r regexp_and_alnum_versus_w.rb
foo
=> "pr\303\251f\303\251r\303\251es"
foo =~ (?-mix:[^[:alnum:]])
=> 2
foo =~ (?-mix:\W)
=> 2
>> exit


-Rob

Rob Biedenharn http://agileconsult...
Rob@AgileConsultingLLC.com