gabriele renzi
1/16/2005 11:37:00 AM
Florian Gross ha scritto:
> David Garamond wrote:
>
>> If someone could summarize the recent Unicode/multibyte string
>> discussion on a wiki, that would be nice (and _very_ useful). It will
>> help programmers prepare their code for Unicode support and backward
>> compatibility in the future. Topics should include:
>
>
> Note that lots of this was recently discussed in [ruby-core:04146]. I'll
> try to answer the questions as accurately as possible.
>
>> - how will strings be stored in memory (which probably be different
>> between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);
>
>
> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
> bytes for one character.) Note that the RString record of Ruby will get
> a new field for the encoding.
>
>> - how to check a string's charset, encoding;
>
>
> String#encoding. It will return a String.
>
>> - how to do various operations in the new multibyte sring, especially
>> those which will be done differently compared to the classic string;
>
>
> Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.
>
>> - what will happen to the classic string (e.g. will it perhaps be
>> renamed to ByteArray or something);
>
>
> The String interface will remain the same. Strings will just get added
> the encoding facilities, but will remain largely backwards compatible
> AFAIK.
>
>> - comparison rules for cross-encoding and cross-charset strings;
>
>
> Strings that have the same encoding and the same bytes are equivalent.
> Strings that have ASCII compatible, but different encodings and only
> ASCII characters are equivalent.
> Everything else is different.
>
> I think there will be ways for converting from one encoding to another
> one, but I don't know the details.
>
>> - regexes;
>
>
> Regexp#encoding is introduced, matching uses similar rules as String
> comparison.
>
>> - how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
>> string support (especially since Ruby is a pretty latecomer in the
>> Unicode scene);
>
>
> I can't really do an in-depth comparison here, because I don't know the
> other languages.
>
> Note that str[0] will return a one-character String and that ?x will do
> the same. There will be a new method like String#code point for getting
> the underlying raw bytes. I think the one-character Strings can later
> still be optimized fairly easily so that they can be immediate Objects.
an addition and two questions: the encoding of the source file will be
indicated with the same approach of python:
#!/usr/bin/ruby
# -*- coding: <encoding name> -*-
or command line option (maybe -K ) or compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use
that ugly magic string?
Also, not that I am an espert, but is localization supposed to work?
i.e. accented letters which are common in european languages are
supposed to be able to be capitalized and such?
Is'nt this related to a charset property of the string different from
encoding ?
IIRC in parrot-land a string is a <stream of
bytes>+<encoding>+<charset>+<language>, how happens that we just care
about one of this things?
Also, given that this seem a huge work.. will it spin off in a proper
indipendent libm17n library ? :)