Asp Forum - perl regexp to ruby one conversion ?

pere.noel

3/23/2006 12:44:00 PM

i've a perl regexp :

$field =~
m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x;

able to detect if $field is of UTF-8 chars or not and i'd like to
convert it into a ruby regexp.

How to do that ?

--
une bévue

13 Answers

James Gray

3/23/2006 2:00:00 PM

On Mar 23, 2006, at 6:43 AM, Une bévue wrote:

> i've a perl regexp :
>
> $field =~
> m/^(
> [\x09\x0A\x0D\x20-\x7E] # ASCII
> | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
> )*$/x;
>
> able to detect if $field is of UTF-8 chars or not and i'd like to
> convert it into a ruby regexp.
>
> How to do that ?

The expression looks fine to me. Did you try using it?

James Edward Gray II

pere.noel

3/23/2006 2:37:00 PM

James Edward Gray II <james@grayproductions.net> wrote:

>
> The expression looks fine to me. Did you try using it?

yes, without the correct result, here is my code :

field='&é§è!çàîûtybvn?'
utf8rgx=Regexp.new('m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x')

the test :

flag=(field === utf8rgx)
p "flag = #{flag}"

the result being :
"flag = false"

i'm sure my encoding is utf-8...

may be i've a misunderstanding of "===" ?

because when trying :

truc = 'toto'
rgx=Regexp.new('^toto$')
flag=(truc === rgx)
p "flag = #{flag}"

i got :
# => "flag = false" ///seems NOT OK to me

flag=(truc =~ rgx)
p "flag = #{flag}"
# => "flag = 0" ///seems OK to me

--
une bévue

Ross Bamford

3/23/2006 2:51:00 PM

On Thu, 2006-03-23 at 23:38 +0900, Une bévue wrote:
> James Edward Gray II <james@grayproductions.net> wrote:
>
> >
> > The expression looks fine to me. Did you try using it?
>
> yes, without the correct result, here is my code :
>
> field='&é§è!çàîûtybvn€'
> utf8rgx=Regexp.new('m/^(
> [\x09\x0A\x0D\x20-\x7E] # ASCII
> | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
> )*$/x')
>
> the test :
>
> flag=(field === utf8rgx)
> p "flag = #{flag}"
>

You'll need to switch those around, as I showed in my response to your
other thread. flag will then be true, but unfortunately I think too
often:

utf8rgx === "onlyascii"
# => true

I think to do that kind of test you'd have to remove the first line
(matching ASCII chars) and not anchor the regexp with ^ and $.

Incidentally, I believe that the regexp above is best translated to Ruby
like this:

utf8rgx = /^(.)*$/u

You should also look into $KCODE (specifically $KCODE = 'u').

(Caveat to the above: I'm not much of an encoding expert at all).

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

James Gray

3/23/2006 2:55:00 PM

On Mar 23, 2006, at 8:38 AM, Une bévue wrote:

> utf8rgx=Regexp.new('m/^(
> [\x09\x0A\x0D\x20-\x7E] # ASCII
> | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
> )*$/x')

Try changing this to:

utf8rgx = / ... /x

Hope that helps.

James Edward Gray II

pere.noel

3/23/2006 3:10:00 PM

James Edward Gray II <james@grayproductions.net> wrote:

> Try changing this to:
>
> utf8rgx = / ... /x
>
> Hope that helps.

ok, thanks, i see what u mean !
--
une bévue

pere.noel

3/23/2006 3:10:00 PM

Ross Bamford <rossrt@roscopeco.co.uk> wrote:

> You'll need to switch those around, as I showed in my response to your
> other thread. flag will then be true, but unfortunately I think too
> often:
>
> utf8rgx === "onlyascii"
> # => true
>
> I think to do that kind of test you'd have to remove the first line
> (matching ASCII chars) and not anchor the regexp with ^ and $.
>
> Incidentally, I believe that the regexp above is best translated to Ruby
> like this:
>
> utf8rgx = /^(.)*$/u
>
> You should also look into $KCODE (specifically $KCODE = 'u').
>
> (Caveat to the above: I'm not much of an encoding expert at all).

ok thanks for all, may be it could be better streaming out all of the
html tags and bringing only part of what's in the <body/>...
--
une bévue

pere.noel

3/23/2006 4:36:00 PM

James Edward Gray II <james@grayproductions.net> wrote:

> > utf8rgx=Regexp.new('m/^(
> > [\x09\x0A\x0D\x20-\x7E] # ASCII
> > | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> > | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> > | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> > | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> > | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> > | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> > | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
> > )*$/x')
>
> Try changing this to:
>
> utf8rgx = / ... /x

the above regexp doesn't work as expected with ruby, i've compared the
output for the same files with perl and ruby, ruby says always "yes it
is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
after wipping out the first line the first ^and the last $)

then, for the time being, i'll use the perl script from ruby in a commad
line fashion...
--
une bévue

ts

3/23/2006 4:48:00 PM

>>>>> "U" == =?ISO-8859-1?Q?Une b=E9vue?= <pere.noel@laponie.com.invalid> writes:

U> the above regexp doesn't work as expected with ruby, i've compared the
U> output for the same files with perl and ruby, ruby says always "yes it
U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
U> after wipping out the first line the first ^and the last $)

moulon% cat b.rb
field='&é§è!çàîûtybvn¤'
utf8rgx=Regexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)

p utf8rgx =~ field
moulon%

moulon% file b.rb
b.rb: ISO-8859 text
moulon%

moulon% ruby b.rb
nil
moulon%

Guy Decoux

pere.noel

3/23/2006 5:13:00 PM

ts <decoux@moulon.inra.fr> wrote:

> p utf8rgx =~ field
> moulon%
>
> moulon% file b.rb
> b.rb: ISO-8859 text
> moulon%
>
> moulon% ruby b.rb
> nil
> moulon%

i don't understand your post )))

my rb file is UTF-8 encoded, at best i can have an answer, from this
script, being the reverse of what is wanted )))

otherwise i get always true...
--
une bévue

ts

3/23/2006 5:21:00 PM

>>>>> "U" == =?ISO-8859-1?Q?Une b=E9vue?= <pere.noel@laponie.com.invalid> writes:

U> i don't understand your post )))

U> ts <decoux@moulon.inra.fr> wrote:

>> moulon% file b.rb
>> b.rb: ISO-8859 text
>> moulon%

my file is ISO-8859 encoded

>> moulon% ruby b.rb
>> nil
>> moulon%

and ruby say NO

U> output for the same files with perl and ruby, ruby says always "yes it
^^^^^^^
U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^

Guy Decoux

comp.lang.ruby

perl regexp to ruby one conversion ?

pere.noel

James Gray

pere.noel

Ross Bamford

James Gray

pere.noel

pere.noel

pere.noel

ts

pere.noel

ts

x Login to ForumsZone