baumanj
4/20/2006 3:52:00 AM
Actually, if you want to deal with multi-byte characters, you have to
make sure to enable that mode. There are three ways to do this
(assuming you want to use UTF-8):
1. Launch ruby (or irb) with -Ku
2. Set the $-K variable to 'u'
3. Add the 'u' option to the end of a regular expression
For example:
>> str = '\350\266\243\345\221\263'
>> str.scan(/./) {|chr| puts "#{chr.inspect} => #{chr}"}
"\350" => ?
"\266" => ?
"\243" => ?
"\345" => ?
"\221" => ?
"\263" => ?
>> str.scan(/./u) {|chr| puts "#{chr.inspect} => #{chr}"}
"\350\266\243" => ?
"\345\221\263" => ?
So a UTF-8 safe each_char method could be:
class String
def each_char
scan(/./u) {|char| yield char }
end
end
Sadly, even when the KCODE is set to UTF-8, String.[] still returns
bytes, even thought the rdoc claims "If passed a single Fixnum, returns
the code of the character at that position". Is this a known issue? It
seems like there should be a way to access UTF-8 characters without
resorting to regular expressions.
Robert Klemme wrote:
> John wrote:
> > I have seen a lot of people having trouble with String not including an
> > iterator for each character, and rather just each byte. I include this
> > snippet in any code that needs to iterate over each character in a
> > string. Simple, elegant, and very very Ruby! Man, I love redefining
> > pre-existing classes.
> >
> >
> > # Now you can use the syntax:
> > # "foobar".each_char do ...
> >
> > class String
> > def each_char
> > each_byte { |byte| yield byte.chr }
> > end
> > end
>
> This method does not yield characters but strings. Also, it won't work
> for multibyte characters. I'm not sure how /./ behaves with multibyte
> chars but I'd say chances are higher that you actually get the proper
> result by doing
>
> str.scan(/./) {|chr| p chr}
>
> Kind regards
>
> robert