[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Multibyte regexps...

Horacio Sanson

12/21/2005 10:00:00 AM



I am having some issues with regular expressions when working with japanese
strings.

Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:

irb(main):271:0> s = "?"
=> "\212\223"
irb(main):272:0> l = "?"
=> "\215s"
irb(main):273:0> l =~ /s/
=> 1
irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
E<s>>
=> nil
irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
=> "\215<<s>>"
irb(main):276:0> s =~ /l/
=> nil


As you can see comparing two totally different characters (kanji) gives me a
match. Reversing the match gives nil.


How can I get ruby to match things correctly??

regards,
Horacio




4 Answers

Chintan Trivedi

12/21/2005 11:31:00 AM

0

l =~ /s/ ??

It will try to find a char "s" in string l and not the value remained in variable s.



Horacio Sanson <hsanson@moegi.waseda.jp> wrote:

I am having some issues with regular expressions when working with japanese
strings.

Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:

irb(main):271:0> s = "é??"
=> "\212\223"
irb(main):272:0> l = "è¡?"
=> "\215s"
irb(main):273:0> l =~ /s/
=> 1
irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
E>
=> nil
irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
=> "\215<>"
irb(main):276:0> s =~ /l/
=> nil


As you can see comparing two totally different characters (kanji) gives me a
match. Reversing the match gives nil.


How can I get ruby to match things correctly??

regards,
Horacio






__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail...

Yukihiro Matsumoto

12/21/2005 12:48:00 PM

0

Hi,

In message "Re: Multibyte regexps..."
on Wed, 21 Dec 2005 18:59:59 +0900, Horacio Sanson <hsanson@moegi.waseda.jp> writes:

|I am having some issues with regular expressions when working with japanese
|strings.
|
|Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:
|
|irb(main):271:0> s = "?"
|=> "\212\223"
|irb(main):272:0> l = "?"
|=> "\215s"
|irb(main):273:0> l =~ /s/
|=> 1
|irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
|E<s>>
|=> nil
|irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
|=> "\215<<s>>"
|irb(main):276:0> s =~ /l/
|=> nil

The encoding seems to be Shift_JIS. You have to specify encoding
before you make regular expression matching. Put s after every
regular expression.

$KCODE="sjis" # to make p work right
p s = "?"
p l = "?"
p l =~ /s/s
puts "#{$`}<<#{$&}>>#{$'}"
p "#{$`}<<#{$&}>>#{$'}"
p s =~ /l/s

matz.


Horacio Sanson

12/26/2005 1:29:00 AM

0

Thanks a lot... this seems to work ok.

Where can I find documentation about this $KCODE global var and the "s" thing
after each regexp? What does the s exactly mean?

Do I have to put it only in regexps with japanese characters or any regexp? I
tried both and saw no difference.

When using Regexp.new to construct the regular expression how can I set the s
to the end of it??

sorry for so many questions but I don't seem to find any docs about these
options.


Horacio

Wednesday 21 December 2005 21:48?Yukihiro Matsumoto ????????:
> Hi,
>
> In message "Re: Multibyte regexps..."
>
> on Wed, 21 Dec 2005 18:59:59 +0900, Horacio Sanson
<hsanson@moegi.waseda.jp> writes:
> |I am having some issues with regular expressions when working with
> | japanese strings.
> |
> |Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:
> |
> |irb(main):271:0> s = "?"
> |=> "\212\223"
> |irb(main):272:0> l = "?"
> |=> "\215s"
> |irb(main):273:0> l =~ /s/
> |=> 1
> |irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
> |E<s>>
> |=> nil
> |irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
> |=> "\215<<s>>"
> |irb(main):276:0> s =~ /l/
> |=> nil
>
> The encoding seems to be Shift_JIS. You have to specify encoding
> before you make regular expression matching. Put s after every
> regular expression.
>
> $KCODE="sjis" # to make p work right
> p s = "?"
> p l = "?"
> p l =~ /s/s
> puts "#{$`}<<#{$&}>>#{$'}"
> p "#{$`}<<#{$&}>>#{$'}"
> p s =~ /l/s
>
> matz.


Horacio Sanson

12/26/2005 2:52:00 AM

0

I found some documentation about this. Thanks.

Just one question, it seems to me that I can make two different things to
allow Regexp's to handle multibyte Shift_JIS strings. One is to set the
$KCODE global variable to "sjis" and the other one is to use the "s" modifier
when constructing the regular expresion.

The question is do I use only one of the two methods or shall I use the "s"
modifier even if I set $KCODE to "sjis"??

My testing tells me that only setting the $KCODE global var is enough to get
Shift_JIS strings and Regexp's to work correctly but I just want to make
sure.

thanks,
Horacio

Monday 26 December 2005 10:29?Horacio Sanson ????????:
> Thanks a lot... this seems to work ok.
>
> Where can I find documentation about this $KCODE global var and the "s"
> thing after each regexp? What does the s exactly mean?
>
> Do I have to put it only in regexps with japanese characters or any regexp?
> I tried both and saw no difference.
>
> When using Regexp.new to construct the regular expression how can I set the
> s to the end of it??
>
> sorry for so many questions but I don't seem to find any docs about these
> options.
>
>
> Horacio
>
> Wednesday 21 December 2005 21:48?Yukihiro Matsumoto ????????:
> > Hi,
> >
> > In message "Re: Multibyte regexps..."
> >
> > on Wed, 21 Dec 2005 18:59:59 +0900, Horacio Sanson
>
> <hsanson@moegi.waseda.jp> writes:
> > |I am having some issues with regular expressions when working with
> > | japanese strings.
> > |
> > |Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:
> > |
> > |irb(main):271:0> s = "?"
> > |=> "\212\223"
> > |irb(main):272:0> l = "?"
> > |=> "\215s"
> > |irb(main):273:0> l =~ /s/
> > |=> 1
> > |irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
> > |E<s>>
> > |=> nil
> > |irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
> > |=> "\215<<s>>"
> > |irb(main):276:0> s =~ /l/
> > |=> nil
> >
> > The encoding seems to be Shift_JIS. You have to specify encoding
> > before you make regular expression matching. Put s after every
> > regular expression.
> >
> > $KCODE="sjis" # to make p work right
> > p s = "?"
> > p l = "?"
> > p l =~ /s/s
> > puts "#{$`}<<#{$&}>>#{$'}"
> > p "#{$`}<<#{$&}>>#{$'}"
> > p s =~ /l/s
> >
> > matz.