Asp Forum - regex \w allows non english characters

Ehud Rosenberg

5/10/2007 3:05:00 PM

Hi everyone...
I'm looking for a way to only allow english characters through a
simple regex.
It seems that \w (altough the documentation states is equivalent to [a-
zA-Z0-9] still allows
non english characters (in my case hebrew).

Has anyone come up with a solution other than specifying [abcdef...]?

Thanks!
Ehud

7 Answers

Robert Klemme

5/10/2007 3:13:00 PM

On 10.05.2007 17:05, Ehud wrote:
> Hi everyone...
> I'm looking for a way to only allow english characters through a
> simple regex.
> It seems that \w (altough the documentation states is equivalent to [a-
> zA-Z0-9] still allows
> non english characters (in my case hebrew).
>
> Has anyone come up with a solution other than specifying [abcdef...]?

[a-zA-Z]

robert

Kyle Schmitt

5/10/2007 7:26:00 PM

I'm making a guess here, but ruby is probably looking at the Hebrew
characters as a normal range of chars, with a character encoding. Now
what encoding Hebrew uses I'm not sure, but for instance the ascii
code for 'a' is 97. The code for one of the Hebrew characters is
probably 97 also. Since ruby doesn't really do UTF, it just sees two
characters, both with a code of 97, and lets them through.

--Kyle

Nobuyoshi Nakada

5/11/2007 4:46:00 AM

Hi,

At Fri, 11 May 2007 04:25:42 +0900,
Kyle Schmitt wrote in [ruby-talk:251082]:
> I'm making a guess here, but ruby is probably looking at the Hebrew
> characters as a normal range of chars, with a character encoding. Now
> what encoding Hebrew uses I'm not sure, but for instance the ascii
> code for 'a' is 97. The code for one of the Hebrew characters is
> probably 97 also. Since ruby doesn't really do UTF, it just sees two
> characters, both with a code of 97, and lets them through.

/[[:alpha:]]/u

--
Nobu Nakada

Ken Bloom

5/11/2007 5:08:00 PM

On Fri, 11 May 2007 04:25:42 +0900, Kyle Schmitt wrote:

> I'm making a guess here, but ruby is probably looking at the Hebrew
> characters as a normal range of chars, with a character encoding. Now
> what encoding Hebrew uses I'm not sure, but for instance the ascii code
> for 'a' is 97. The code for one of the Hebrew characters is probably 97
> also. Since ruby doesn't really do UTF, it just sees two characters,
> both with a code of 97, and lets them through.

Unless you're using special fonts that do a special mapping (which is
generally no longer done these days), non-English characters are always
found in characters 128-255. Different encodings are simply different
ways of mapping these characters to different languages. 0-127 are always
the same English ASCII characters.

×©×?×ª ×©×?×?×
--Ken Bloom

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu...

Kyle Schmitt

5/11/2007 7:28:00 PM

As I said a guess ;) That's really interesting though. So had it
been for chars outside of english, I would have been on the ball...
Any chances that ruby's regex will work on utf8(or 16 or 7 or any of
the variants)?

eden li

5/12/2007 6:08:00 AM

The meaning of \w can change if you alter the global $KCODE variable.
It's best to specify exactly what you mean if you know exactly what
you want (eg, follow Robert's advice). Specifying \w says that you
want "wordful," non-breaking characters; this includes non-English
characters, even CJK.

irb(main):001:0> s = "??? ????"
=> "\327\251\327\221\327\252 \327\251\327\234\327\225\327\235"
irb(main):002:0> s =~ /\w/ ? "match" : "no match"
=> "no match"
irb(main):003:0> $KCODE = "u"
=> "u"
irb(main):004:0> s =~ /\w/ ? "match" : "no match"
=> "match"

On May 10, 11:10 pm, Ehud <ehud...@gmail.com> wrote:
> Hi everyone...
> I'm looking for a way to only allow english characters through a
> simple regex.
> It seems that \w (altough the documentation states is equivalent to [a-
> zA-Z0-9] still allows
> non english characters (in my case hebrew).
>
> Has anyone come up with a solution other than specifying [abcdef...]?
>
> Thanks!
> Ehud

eden li

5/14/2007 3:50:00 AM

Depends on what you mean by "work." If you don't set a global $KCODE
and you don't specify a language as part of the regex options, all
regular expressions will work on the byte level. It appears ruby
(1.8.x) only supports utf-8 if you set $KCODE = "u" or pass in a u as
a regex option.

>> "??" =~ /(\w)/u and $1
=> "?"
>> Iconv.iconv("utf-16", "utf-8", "??") =~ /(\w)/u and $1
=> false

On May 12, 3:27 am, "Kyle Schmitt" <kyleaschm...@gmail.com> wrote:
> As I said a guess ;) That's really interesting though. So had it
> been for chars outside of english, I would have been on the ball...
> Any chances that ruby's regex will work on utf8(or 16 or 7 or any of
> the variants)?

comp.lang.ruby

regex \w allows non english characters

Ehud Rosenberg

Robert Klemme

Kyle Schmitt

Nobuyoshi Nakada

Ken Bloom

Kyle Schmitt

eden li

eden li

x Login to ForumsZone