Ross Bamford
3/23/2006 2:51:00 PM
On Thu, 2006-03-23 at 23:38 +0900, Une bévue wrote:
> James Edward Gray II <james@grayproductions.net> wrote:
>
> >
> > The expression looks fine to me. Did you try using it?
>
> yes, without the correct result, here is my code :
>
> field='&é§è!çàîûtybvn€'
> utf8rgx=Regexp.new('m/^(
> [\x09\x0A\x0D\x20-\x7E] # ASCII
> | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
> )*$/x')
>
> the test :
>
> flag=(field === utf8rgx)
> p "flag = #{flag}"
>
You'll need to switch those around, as I showed in my response to your
other thread. flag will then be true, but unfortunately I think too
often:
utf8rgx === "onlyascii"
# => true
I think to do that kind of test you'd have to remove the first line
(matching ASCII chars) and not anchor the regexp with ^ and $.
Incidentally, I believe that the regexp above is best translated to Ruby
like this:
utf8rgx = /^(.)*$/u
You should also look into $KCODE (specifically $KCODE = 'u').
(Caveat to the above: I'm not much of an encoding expert at all).
--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk