Asp Forum - utf-8 & Range under eruby (possibly Rails) problems

Johan Sörensen

12/17/2004 3:43:00 PM

Hi,

I'm having some issues with a range that truncates texts, the below is
a (very) simplified version of the truncate method thats used in rails
(which is where I discovered this):

# this in an utf-8 encoded erb template (a rails "view" in my case)
<% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
hackar cocoa" -%>
<%= text[0..47] %>
<br />
<%= text[0..48] %>
<br />
# notice the 'o' in ingenjor instead of 'ö'
<% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
och hackar cocoa" -%>
<%= othertext[0..47] %>

#produces this (the last character on the first line will display as
a "funny character" in browsers)

Eftersom jag jobbar som kontruktör/ingenjör p?
Eftersom jag jobbar som kontruktör/ingenjör på
Eftersom jag jobbar som kontruktör/ingenjor på

Is this a possible bug in Ruby (1.8.1) or could it be something with
Rails that gets in the way, I can reproduce this across two servers
and in webrick.
I was unable to do this properly in irb, since my terminal (or irb)
would act funny on the öäå's..

--johan

--
Johan Sørensen
Professional Futurist
www.johansorensen.com

4 Answers

Carlos

12/17/2004 6:21:00 PM

[Johan Sörensen <johans@gmail.com>, 2004-12-17 16.42 CET]
> # this in an utf-8 encoded erb template (a rails "view" in my case)
> <% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
> hackar cocoa" -%>
> <%= text[0..47] %>
> <br />
> <%= text[0..48] %>
> <br />
> # notice the 'o' in ingenjor instead of 'ö'
> <% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
> och hackar cocoa" -%>
> <%= othertext[0..47] %>
>
> #produces this (the last character on the first line will display as
> a "funny character" in browsers)
>
> Eftersom jag jobbar som kontruktör/ingenjör p?
> Eftersom jag jobbar som kontruktör/ingenjör på
> Eftersom jag jobbar som kontruktör/ingenjor på
>
>
> Is this a possible bug in Ruby (1.8.1) or could it be something with
> Rails that gets in the way, I can reproduce this across two servers
> and in webrick.

It is a Ruby feature :). Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).

Here is something you can start from:

module UTF8Str
def [] (*params)
if params.all? { |p| Integer===p } ||
params.size==1 && Range===params[0]
res = self.unpack("U*").[](*params)
res = [res] unless Array===res
return res.pack("U*")
end
super
end
end

a="áéióúü"
a.extend UTF8Str

puts a[0], a[1], a[2], a[3], a[4], a[1,2], a[1..2], a[-1]

Good luck.

--

Johan Sörensen

12/17/2004 6:34:00 PM

On Sat, 18 Dec 2004 03:20:41 +0900, Carlos <angus@quovadis.com.ar> wrote:
> It is a Ruby feature :). Indices in strings are bytes, not chars. For the
> moment, you must develop your own indexing routines for UTF-8 strings
> (notice that String#[/regex/] works, because regexes are UTF-8 aware).

I see.

The thing that has me confused though, is that it's not consistant
since it'll only happen on the first line in the example I gave.
I expand the range a little and it'll pass through untouched. I change
either off the preceeding ö's it'll pass through untouched.

Is this expected behaviour?

-- johan

Carlos

12/17/2004 6:55:00 PM

[Johan Sörensen <johans@gmail.com>, 2004-12-17 19.34 CET]
> On Sat, 18 Dec 2004 03:20:41 +0900, Carlos <angus@quovadis.com.ar> wrote:
> > It is a Ruby feature :). Indices in strings are bytes, not chars. For the
> > moment, you must develop your own indexing routines for UTF-8 strings
> > (notice that String#[/regex/] works, because regexes are UTF-8 aware).
>
> I see.
>
> The thing that has me confused though, is that it's not consistant
> since it'll only happen on the first line in the example I gave.
> I expand the range a little and it'll pass through untouched. I change
> either off the preceeding ö's it'll pass through untouched.

Well, because "ö".length == 2 (UTF-8 is a multibyte encoding). Your range's
end was falling between the two bytes of the "ö".

--

Michael DeHaan

12/17/2004 8:06:00 PM

Someone on PerlMonks taught me a neat trick. A regex split about
nothing returns an array of one-character strings. It's true for Ruby
as well ... So these indexing routines are really simple.

some_string.split(//).each { |c|
...
}

# or ... some_string.split(//)[5]

Carlos> It is a Ruby feature :). Indices in strings are bytes, not
chars. For the
Carlos> moment, you must develop your own indexing routines for UTF-8 strings
Carlos> (notice that String#[/regex/] works, because regexes are UTF-8 aware).

comp.lang.ruby

utf-8 & Range under eruby (possibly Rails) problems

Johan Sörensen

Carlos

Johan Sörensen

Carlos

Michael DeHaan

x Login to ForumsZone