Asp Forum - state of unicode support

Chad Perrin

7/28/2006 3:01:00 PM

I've heard rumors that "oniguruma fixes everything", and the like. I'm
sure that's a touch of hyperbole, but in any case:

What's the current state of Unicode support in Ruby? My recollection is
of Unicode support somewhat lacking.

--
CCD CopyWrite Chad Perrin [ http://ccd.ap... ]
Brian K. Reid: "In computer science, we stand on each other's feet."

20 Answers

why the lucky stiff

7/28/2006 7:13:00 PM

On Sat, Jul 29, 2006 at 01:08:06AM +0900, Charles O Nutter wrote:
> Oh man, I really don't have the energy for this thread again :) Chad: if you
> get a straight answer about this, let me know. Others: Is there a simple,
> straightforward FAQ entry somewhere that says "to use Unicode you have the
> following choices"? This keeps coming up.

This isn't a complete answer, but it's the best I can do to help Chad out.
If you really want to solve the question now, Chad, I'd read Julian Tarkhanov's
UNICODE_PRIMER[1].

First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.

Ruby's strings are not unicode-aware. There is a library called 'jcode', which
comes with Ruby which tries to help out, but it's very simple, only good for a
few things like counting characters and iterating through characters. Again,
UTF-8 only.

Ruby itself also understands UTF-8 regular expressions to a degree. Using the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string containing a
multibyte character. (Also: str.unpack('U*').)

If you are using Unicode strings in Rails, check out Julian's unicode_hacks
plugin: <http://julik.textdriven.com/svn/tools/rails_plugins/unicode_...
They have a channel on irc.freenode.net: #multibyte_rails.

The unicode_hacks plugin is interesting in that it tries to load one of several
Ruby unicode extensions before falling back to str.unpack('U*') mode.

Here are the extensions it prefers, in order:

* icu4r: a Ruby extension to IBM's ICU library. Adds UString, URegexp, etc.
classes for containing Unicode stuffs.
(project page[3] and docs[4])
* utf8proc: a small library for iterating through characters and converting
ints to code points. Adds String#utf8map and Integer#utf8, for example.
(download[5])
* unicode: a little extension by Yoshida Masato which adds Unicode class
methods for `strcmp`, `[de]compose`, normalization and case conversion for
utf-8.
(download[6] and readme[7])

So, many options, some massive, but most only partial and in their infancy.

The most recent entrant into this race, though, is Nikolai Weibull's
ruby-character-encoding library, which aims to get complete multibyte support
into Ruby 1.8's string class. If you use it, it will probably break a lot of
libraries which are used to strings acting the way they do now.
He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]

Nevertheless, it is a very promising library and Nikolai is working at
break-neck pace to appease the nations, all tongues and peoples.[9] And
discussion is here[10] with links to the mailing list and all that.

This might be a landslide of information, but it's better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.

Signed in elaborate calligraphy with a picture of grapes at the end,

_why

[1] http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/UNIC...
[2] http://www.geocities.jp/kosako3/...
[3] http://rubyforge.org/proje...
[4] http://icu4r.ruby...
[5] http://www.flexiguided.de/publications.utf8pr...
[6] http://www.yoshidam.net...
[7] http://www.yoshidam.net/u...
[8] http://redhanded.hobix.com/inspect/futurismUnicodeI...
[9] http://git.bitwi.se/?p=ruby-character-encodings.git...
[10] http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAll...

Chad Perrin

7/28/2006 7:23:00 PM

On Sat, Jul 29, 2006 at 04:13:04AM +0900, why the lucky stiff wrote:
>
> This might be a landslide of information, but it's better than spending all day
> Googling and extracting tarballs and pouring through READMEs just to get a
> picture of what's happening these days.

That was most excellent. Thank you for your kind assistance: it answers
my question quite well, and I appreciate your effort.

>
> Signed in elaborate calligraphy with a picture of grapes at the end,

. . and as always, you manage to entertain in the process.

--
CCD CopyWrite Chad Perrin [ http://ccd.ap... ]
"The first rule of magic is simple. Don't waste your time waving your
hands and hopping when a rock or a club will do." - McCloctnick the Lucid

Matt Todd

7/28/2006 7:35:00 PM

So, the problem with Unicode support in Ruby is that the code
currently assumes that each letter is one byte, instead of multiple?
This includes presumably search algorithms (for Regexs, et al), then?

Or is my understanding warped and wrong?

_Why, et al, if you could break down the actual difficulties with
implementing Unicode support into Ruby 1.8, I think that might clear
up the questions we have as to whether a library eradicates all
problems (obviously, some problems can't be fixed, but merely hacked
or worked around).

Cheers, folks; remember to be nice. We're on the same team.

M.T.

Eric Armstrong

7/28/2006 9:00:00 PM

Spectacular summary. As a lurker on this thread,
I greatly appreciate it.

why the lucky stiff wrote:
> On Sat, Jul 29, 2006 at 01:08:06AM +0900, Charles O Nutter wrote:
>> Oh man, I really don't have the energy for this thread again :) Chad: if you
>> get a straight answer about this, let me know. Others: Is there a simple,
>> straightforward FAQ entry somewhere that says "to use Unicode you have the
>> following choices"? This keeps coming up.
>
> This isn't a complete answer, but it's the best I can do to help Chad out.
> If you really want to solve the question now, Chad, I'd read Julian Tarkhanov's
> UNICODE_PRIMER[1].
>
> First, Onigurama[2] is a regular expression engine. It supports Unicode regular
> expressions under many encodings, it's very handy. If all you want to do is
> search strings for Unicode text, then great, use it.
>
> Ruby's strings are not unicode-aware. There is a library called 'jcode', which
> comes with Ruby which tries to help out, but it's very simple, only good for a
> few things like counting characters and iterating through characters. Again,
> UTF-8 only.
>
> Ruby itself also understands UTF-8 regular expressions to a degree. Using the
> 'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
> str.scan(/./u), which returns an array of strings, each string containing a
> multibyte character. (Also: str.unpack('U*').)
>
> If you are using Unicode strings in Rails, check out Julian's unicode_hacks
> plugin: <http://julik.textdriven.com/svn/tools/rails_plugins/unicode_...
> They have a channel on irc.freenode.net: #multibyte_rails.
>
> The unicode_hacks plugin is interesting in that it tries to load one of several
> Ruby unicode extensions before falling back to str.unpack('U*') mode.
>
> Here are the extensions it prefers, in order:
>
> * icu4r: a Ruby extension to IBM's ICU library. Adds UString, URegexp, etc.
> classes for containing Unicode stuffs.
> (project page[3] and docs[4])
> * utf8proc: a small library for iterating through characters and converting
> ints to code points. Adds String#utf8map and Integer#utf8, for example.
> (download[5])
> * unicode: a little extension by Yoshida Masato which adds Unicode class
> methods for `strcmp`, `[de]compose`, normalization and case conversion for
> utf-8.
> (download[6] and readme[7])
>
> So, many options, some massive, but most only partial and in their infancy.
>
> The most recent entrant into this race, though, is Nikolai Weibull's
> ruby-character-encoding library, which aims to get complete multibyte support
> into Ruby 1.8's string class. If you use it, it will probably break a lot of
> libraries which are used to strings acting the way they do now.
> He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]
>
> Nevertheless, it is a very promising library and Nikolai is working at
> break-neck pace to appease the nations, all tongues and peoples.[9] And
> discussion is here[10] with links to the mailing list and all that.
>
> This might be a landslide of information, but it's better than spending all day
> Googling and extracting tarballs and pouring through READMEs just to get a
> picture of what's happening these days.
>
> Signed in elaborate calligraphy with a picture of grapes at the end,
>
> _why
>
> [1] http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/UNIC...
> [2] http://www.geocities.jp/kosako3/...
> [3] http://rubyforge.org/proje...
> [4] http://icu4r.ruby...
> [5] http://www.flexiguided.de/publications.utf8pr...
> [6] http://www.yoshidam.net...
> [7] http://www.yoshidam.net/u...
> [8] http://redhanded.hobix.com/inspect/futurismUnicodeI...
> [9] http://git.bitwi.se/?p=ruby-character-encodings.git...
> [10] http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAll...
>

Tim Bray

7/29/2006 8:34:00 AM

On Jul 28, 2006, at 12:13 PM, why the lucky stiff wrote:

> First, Onigurama[2] is a regular expression engine. It supports
> Unicode regular
> expressions under many encodings, it's very handy. If all you want
> to do is
> search strings for Unicode text, then great, use it.

Er uh well it doesn't do unicode properties so you can't use things
like \p{L} which, once you've found them, quickly come to feel
essential. Anytime you write [a-zA-Z] in a regex, you've probably
just uttered a bug So I would say that Oniguruma has holes.

Otherwise, a very useful landslide indeed. -Tim

Michal Suchanek

7/31/2006 10:29:00 AM

On 7/28/06, Matt Todd <chiology@gmail.com> wrote:
> So, the problem with Unicode support in Ruby is that the code
> currently assumes that each letter is one byte, instead of multiple?
> This includes presumably search algorithms (for Regexs, et al), then?
>
> Or is my understanding warped and wrong?

Regexes in 1.8 can do utf-8.

>
> _Why, et al, if you could break down the actual difficulties with
> implementing Unicode support into Ruby 1.8, I think that might clear
> up the questions we have as to whether a library eradicates all
> problems (obviously, some problems can't be fixed, but merely hacked
> or worked around).

The problem is with compatibility. In 1.8 it is expected that strings
are arrays of bytes. You can split them to characters with a regex or
convert into a sequence of codepoints. But no standard library or
function would understand that (except the single one that is there
for undoing the transformation).

So you have the choice to work with utf-8 strings and regexes, and
whenever you want characters convert the strings so that you get to
characters.

Or you can use a special unicode string class (such as from icu4r)
that no standard functions understand. Some may be able to do to_s but
you get a normal string then.

Or you can change the strings to handle utf-8 (or any other multibyte)
characters, and probably break most of the standard functions.

None of these is completely satisfactory because it is far from
_transparent_ unicode support in the standard string class. That is
planned for 2.0.

Thanks

Michal

Alex Young

7/31/2006 2:52:00 PM

Tim Bray wrote:
> On Jul 28, 2006, at 12:13 PM, why the lucky stiff wrote:
>
>> First, Onigurama[2] is a regular expression engine. It supports
>> Unicode regular
>> expressions under many encodings, it's very handy. If all you want
>> to do is
>> search strings for Unicode text, then great, use it.
>
>
> Er uh well it doesn't do unicode properties so you can't use things
> like \p{L}

Off topic, what does/would that do? Match a lower-case symbol?

--
Alex

Tim Bray

7/31/2006 3:11:00 PM

On Jul 31, 2006, at 7:52 AM, Alex Young wrote:

>>> First, Onigurama[2] is a regular expression engine. It supports
>>> Unicode regular
>>> expressions under many encodings, it's very handy. If all you
>>> want to do is
>>> search strings for Unicode text, then great, use it.
>> Er uh well it doesn't do unicode properties so you can't use
>> things like \p{L}
>
> Off topic, what does/would that do? Match a lower-case symbol?

Unicode characters have named properties. "L" means it's a letter.
There are sub-properties like Lu and Ll for upper and lower case.
There are lots more properties for things like being numbers, being
white-space, combining forms and particular properties of Asian
characters and so on. Tremendously useful in regexes, particularly
for those of us round-eye gringos who are prone to write [a-zA-Z] and
think we're matching letters, which we're not. If you don't support
properties, you don't support Unicode. -Tim

Julian 'Julik' Tarkhanov

7/31/2006 3:25:00 PM

On 28-jul-2006, at 21:13, why the lucky stiff wrote:

> Ruby itself also understands UTF-8 regular expressions to a
> degree. Using the
> 'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
> str.scan(/./u), which returns an array of strings, each string
> containing a
> multibyte character. (Also: str.unpack('U*').)

Which is actually useless because this breaks your string between
codepoints, not between characters. ICU4R currently resolves this, as
well as a library posted
on ruby-talk a while ago (with proper text boudnary handling).
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Alex Young

7/31/2006 3:25:00 PM

Tim Bray wrote:
> On Jul 31, 2006, at 7:52 AM, Alex Young wrote:
>
>>>> First, Onigurama[2] is a regular expression engine. It supports
>>>> Unicode regular
>>>> expressions under many encodings, it's very handy. If all you
>>>> want to do is
>>>> search strings for Unicode text, then great, use it.
>>>
>>> Er uh well it doesn't do unicode properties so you can't use things
>>> like \p{L}
>>
>>
>> Off topic, what does/would that do? Match a lower-case symbol?
>
>
> Unicode characters have named properties. "L" means it's a letter.
> There are sub-properties like Lu and Ll for upper and lower case.
> There are lots more properties for things like being numbers, being
> white-space, combining forms and particular properties of Asian
> characters and so on. Tremendously useful in regexes, particularly for
> those of us round-eye gringos who are prone to write [a-zA-Z] and think
> we're matching letters, which we're not. If you don't support
> properties, you don't support Unicode. -Tim
>
>
Gotcha. Thanks for that.

--
Alex

comp.lang.ruby

state of unicode support

Chad Perrin

why the lucky stiff

Chad Perrin

Matt Todd

Eric Armstrong

Tim Bray

Michal Suchanek

Alex Young

Tim Bray

Julian 'Julik' Tarkhanov

Alex Young

x Login to ForumsZone