Asp Forum - Unicode/multibyte string support in Ruby1.9/Ruby summary?

David Garamond

1/15/2005 2:14:00 PM

If someone could summarize the recent Unicode/multibyte string
discussion on a wiki, that would be nice (and _very_ useful). It will
help programmers prepare their code for Unicode support and backward
compatibility in the future. Topics should include:

- how will strings be stored in memory (which probably be different
between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

- how to check a string's charset, encoding;

- how to do various operations in the new multibyte sring, especially
those which will be done differently compared to the classic string;

- what will happen to the classic string (e.g. will it perhaps be
renamed to ByteArray or something);

- comparison rules for cross-encoding and cross-charset strings;

- regexes;

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
string support (especially since Ruby is a pretty latecomer in the
Unicode scene);

Regards,
dave

8 Answers

Florian Gross

1/15/2005 5:45:00 PM

David Garamond wrote:

> If someone could summarize the recent Unicode/multibyte string
> discussion on a wiki, that would be nice (and _very_ useful). It will
> help programmers prepare their code for Unicode support and backward
> compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll
try to answer the questions as accurately as possible.

> - how will strings be stored in memory (which probably be different
> between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
bytes for one character.) Note that the RString record of Ruby will get
a new field for the encoding.

> - how to check a string's charset, encoding;

String#encoding. It will return a String.

> - how to do various operations in the new multibyte sring, especially
> those which will be done differently compared to the classic string;

Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.

> - what will happen to the classic string (e.g. will it perhaps be
> renamed to ByteArray or something);

The String interface will remain the same. Strings will just get added
the encoding facilities, but will remain largely backwards compatible AFAIK.

> - comparison rules for cross-encoding and cross-charset strings;

Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only
ASCII characters are equivalent.
Everything else is different.

I think there will be ways for converting from one encoding to another
one, but I don't know the details.

> - regexes;

Regexp#encoding is introduced, matching uses similar rules as String
comparison.

> - how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
> string support (especially since Ruby is a pretty latecomer in the
> Unicode scene);

I can't really do an in-depth comparison here, because I don't know the
other languages.

Note that str[0] will return a one-character String and that ?x will do
the same. There will be a new method like String#code point for getting
the underlying raw bytes. I think the one-character Strings can later
still be optimized fairly easily so that they can be immediate Objects.

ts

1/15/2005 5:58:00 PM

>>>>> "F" == Florian Gross <flgr@ccan.de> writes:

F> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
F> bytes for one character.) Note that the RString record of Ruby will get
F> a new field for the encoding.

Are you sure ? or I've not understood what you are trying to say.

Guy Decoux

Yukihiro Matsumoto

1/15/2005 6:37:00 PM

Hi,

In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?"
on Sun, 16 Jan 2005 02:58:20 +0900, ts <decoux@moulon.inra.fr> writes:

|F> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
|F> bytes for one character.) Note that the RString record of Ruby will get
|F> a new field for the encoding.
|
| Are you sure ? or I've not understood what you are trying to say.

He's right, except that the encoding will be stored using the FL_USER
flags or an instance variable of the string.

matz.

David Garamond

1/15/2005 8:01:00 PM

Florian Gross wrote:
> David Garamond wrote:
>
>> If someone could summarize the recent Unicode/multibyte string
>> discussion on a wiki, that would be nice (and _very_ useful). It will
>> help programmers prepare their code for Unicode support and backward
>> compatibility in the future. Topics should include:
>
> Note that lots of this was recently discussed in [ruby-core:04146]. I'll
> try to answer the questions as accurately as possible.

Thanks for the answers, Florian. Yes I was following the thread on
ruby-core too, but forgot that this is ruby-talk.

I have created the first draft in RubyGarden:

http://www.rubygarden.org/ruby?Unic...

It's very raw and bare-bones (plus I'm an ASCII guy and totally clueless
regarding multibyte/Unicode). I invite people to improve on it.

Thanks.

Regards,
dave

ts

1/16/2005 11:00:00 AM

>>>>> "Y" == Yukihiro Matsumoto <matz@ruby-lang.org> writes:

Y> He's right, except that the encoding will be stored using the FL_USER
Y> flags or an instance variable of the string.

My question was precisely about

"RString record of Ruby will get a new field"

i.e. I've read ruby_m17n :-)

Guy Decoux

gabriele renzi

1/16/2005 11:37:00 AM

Florian Gross ha scritto:
> David Garamond wrote:
>
>> If someone could summarize the recent Unicode/multibyte string
>> discussion on a wiki, that would be nice (and _very_ useful). It will
>> help programmers prepare their code for Unicode support and backward
>> compatibility in the future. Topics should include:
>
>
> Note that lots of this was recently discussed in [ruby-core:04146]. I'll
> try to answer the questions as accurately as possible.
>
>> - how will strings be stored in memory (which probably be different
>> between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);
>
>
> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
> bytes for one character.) Note that the RString record of Ruby will get
> a new field for the encoding.
>
>> - how to check a string's charset, encoding;
>
>
> String#encoding. It will return a String.
>
>> - how to do various operations in the new multibyte sring, especially
>> those which will be done differently compared to the classic string;
>
>
> Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.
>
>> - what will happen to the classic string (e.g. will it perhaps be
>> renamed to ByteArray or something);
>
>
> The String interface will remain the same. Strings will just get added
> the encoding facilities, but will remain largely backwards compatible
> AFAIK.
>
>> - comparison rules for cross-encoding and cross-charset strings;
>
>
> Strings that have the same encoding and the same bytes are equivalent.
> Strings that have ASCII compatible, but different encodings and only
> ASCII characters are equivalent.
> Everything else is different.
>
> I think there will be ways for converting from one encoding to another
> one, but I don't know the details.
>
>> - regexes;
>
>
> Regexp#encoding is introduced, matching uses similar rules as String
> comparison.
>
>> - how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
>> string support (especially since Ruby is a pretty latecomer in the
>> Unicode scene);
>
>
> I can't really do an in-depth comparison here, because I don't know the
> other languages.
>
> Note that str[0] will return a one-character String and that ?x will do
> the same. There will be a new method like String#code point for getting
> the underlying raw bytes. I think the one-character Strings can later
> still be optimized fairly easily so that they can be immediate Objects.

an addition and two questions: the encoding of the source file will be
indicated with the same approach of python:
#!/usr/bin/ruby
# -*- coding: <encoding name> -*-

or command line option (maybe -K ) or compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use
that ugly magic string?

Also, not that I am an espert, but is localization supposed to work?
i.e. accented letters which are common in european languages are
supposed to be able to be capitalized and such?
Is'nt this related to a charset property of the string different from
encoding ?
IIRC in parrot-land a string is a <stream of
bytes>+<encoding>+<charset>+<language>, how happens that we just care
about one of this things?

Also, given that this seem a huge work.. will it spin off in a proper
indipendent libm17n library ? :)

Yukihiro Matsumoto

1/16/2005 2:01:00 PM

Hi,

In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?"
on Sun, 16 Jan 2005 19:59:30 +0900, ts <decoux@moulon.inra.fr> writes:

|Y> He's right, except that the encoding will be stored using the FL_USER
|Y> flags or an instance variable of the string.
|
| My question was precisely about
|
| "RString record of Ruby will get a new field"
|
| i.e. I've read ruby_m17n :-)

I know that you know. It's just for rest of us.

matz.

nobu.nokada

1/16/2005 2:12:00 PM

Hi,

At Sun, 16 Jan 2005 20:41:08 +0900,
gabriele renzi wrote in [ruby-talk:126677]:
> an addition and two questions: the encoding of the source file will be
> indicated with the same approach of python:
> #!/usr/bin/ruby
> # -*- coding: <encoding name> -*-
>
> or command line option (maybe -K ) or compile time configuration time.
> But I wonder: why can't we keep using $KCODE for this and have to use
> that ugly magic string?

Since encodings may vary per files, so -K would not enough.

--
Nobu Nakada

comp.lang.ruby

Unicode/multibyte string support in Ruby1.9/Ruby summary?

David Garamond

Florian Gross

ts

Yukihiro Matsumoto

David Garamond

ts

gabriele renzi

Yukihiro Matsumoto

nobu.nokada

x Login to ForumsZone