Carlos
12/17/2004 6:21:00 PM
[Johan Sörensen <johans@gmail.com>, 2004-12-17 16.42 CET]
> # this in an utf-8 encoded erb template (a rails "view" in my case)
> <% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
> hackar cocoa" -%>
> <%= text[0..47] %>
> <br />
> <%= text[0..48] %>
> <br />
> # notice the 'o' in ingenjor instead of 'ö'
> <% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
> och hackar cocoa" -%>
> <%= othertext[0..47] %>
>
> #produces this (the last character on the first line will display as
> a "funny character" in browsers)
>
> Eftersom jag jobbar som kontruktör/ingenjör p?
> Eftersom jag jobbar som kontruktör/ingenjör på
> Eftersom jag jobbar som kontruktör/ingenjor på
>
>
> Is this a possible bug in Ruby (1.8.1) or could it be something with
> Rails that gets in the way, I can reproduce this across two servers
> and in webrick.
It is a Ruby feature :). Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).
Here is something you can start from:
module UTF8Str
def [] (*params)
if params.all? { |p| Integer===p } ||
params.size==1 && Range===params[0]
res = self.unpack("U*").[](*params)
res = [res] unless Array===res
return res.pack("U*")
end
super
end
end
a="áéióúü"
a.extend UTF8Str
puts a[0], a[1], a[2], a[3], a[4], a[1,2], a[1..2], a[-1]
Good luck.
--