Asp Forum - sorting Array of accentuated Strings

unbewusst.sein

12/6/2007 7:37:00 PM

I've done a "self <=> anotherString" comparaison which works by itself :

class String
def <=>( aString )
[blahblahblah]
end
end

however if i want to sort an Array of Strings, the Array is sorted as
usual within Ruby (put accentuated characters at the end) :

a = [ "Être", "Fenêtre", "Etre" ]

b = a.sort { | i, j | i <=> j }
puts "b = [ " + b.join(", ") + " ]\n"
# => b = [ Etre, Être, Fenêtre ]

c = a.sort #<=>( aString ) NOT CALLED...
puts "c = [ " + c.join(", ") + " ]\n"
# => c = [ Etre, Fenêtre, Être ]

Why, in the case of "c = a.sort", #<=>( aString ) isn't called ?
the comparaison within an Array could compare different kind of objects
like :

a = [ 0, "a", 9, Time.new ] ???

--
Une Bévue

8 Answers

Dan Yoder

12/6/2007 10:27:00 PM

On Dec 6, 11:36 am, unbewusst.s...@weltanschauung.com.invalid (Une
Bévue) wrote:
> I've done a "self <=> anotherString" comparaison which works by itself :
>
> class String
> def <=>( aString )
> [blahblahblah]
> end
> end
>
> however if i want to sort an Array of Strings, the Array is sorted as
> usual within Ruby (put accentuated characters at the end) :
>
> a = [ "Être", "Fenêtre", "Etre" ]
>
> b = a.sort { | i, j | i <=> j }
> puts "b = [ " + b.join(", ") + " ]\n"
> # => b = [ Etre, Être, Fenêtre ]
>
> c = a.sort #<=>( aString ) NOT CALLED...
> puts "c = [ " + c.join(", ") + " ]\n"
> # => c = [ Etre, Fenêtre, Être ]
>
> Why, in the case of "c = a.sort", #<=>( aString ) isn't called ?
> the comparaison within an Array could compare different kind of objects
> like :
>
> a = [ 0, "a", 9, Time.new ] ???
>
> --
> Une Bévue

I believe that the C implementation Array#sort checks for a String
argument and calls the C string comparison operator directly in that
case. See:

http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/array.c?v...

under the code for sort_2.

You might be interested in using some of the emerging Unicode support
in Ruby. Ruby 2 will have it built-in and there are several libraries
out there, although I don't have any experience using them.

Regards,
Dan Yoder
http://dev.ze...

unbewusst.sein

12/8/2007 7:09:00 AM

cruiserdan <dan@zeraweb.com> wrote:

>
> You might be interested in using some of the emerging Unicode support
> in Ruby. Ruby 2 will have it built-in and there are several libraries
> out there, although I don't have any experience using them.

right, thanks, i've only wrote a workaround before getting Ruby 2...
--
Une Bévue

Rishabh Shrivastava

12/8/2007 7:31:00 AM

Une BÃ©v
ue wrote:
> cruiserdan <dan@zeraweb.com> wrote:
>
>>
>> You might be interested in using some of the emerging Unicode support
>> in Ruby. Ruby 2 will have it built-in and there are several libraries
>> out there, although I don't have any experience using them.
>
> right, thanks, i've only wrote a workaround before getting Ruby 2...

YEAH, aCTUALLY I WANT TO CREATE A FUNCTION FOR vpn CONNECTION FOR RUBY
IN WATIR,
pLEASE PROVIDE IF IT IS POSSIBLE.

--
Posted via http://www.ruby-....

unbewusst.sein

12/8/2007 10:22:00 AM

cruiserdan <dan@zeraweb.com> wrote:

> You might be interested in using some of the emerging Unicode support
> in Ruby.

Unfortunately i can't get :
<ftp://ftp.mars.org/pub/ruby/Unicode.t...
may be the server is down ???

--
Une Bévue

MonkeeSage

12/8/2007 1:12:00 PM

On Dec 8, 1:09 am, unbewusst.s...@weltanschauung.com.invalid (Une
Bévue) wrote:
> cruiserdan <d...@zeraweb.com> wrote:
>
> > You might be interested in using some of the emerging Unicode support
> > in Ruby. Ruby 2 will have it built-in and there are several libraries
> > out there, although I don't have any experience using them.
>
> right, thanks, i've only wrote a workaround before getting Ruby 2...
> --
> Une Bévue

Hmmm. Maybe I'm mistaken, but this seems to have nothing to do with
unicode. An ascii char is always going to be less than a utf-8 char,
since utf-8 is a superset of ascii.

Fenêtre <=> Être ->

F (\x46) <=> Ê (\xc3\x8a) ->

-1

To get the right behavior I think you have to translate the utf-8
characters to ascii. You can try something like:

require 'iconv'
class String
def translit
Iconv.iconv('ascii//translit', 'utf-8', self)[0]
end
end
a.sort { | i, j | i.translit <=> j.translit }

But some people have had strange effects from #iconv (e.g., a recent
thread [1]).

Regards,
Jordan

[1] http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/9fbb85fa49dd700f/26311c1a3844267d#26311c...

Axel Etzold

12/8/2007 2:29:00 PM

-------- Original-Nachricht --------
> Datum: Sat, 8 Dec 2007 22:15:00 +0900
> Von: MonkeeSage <MonkeeSage@gmail.com>
> An: ruby-talk@ruby-lang.org
> Betreff: Re: sorting Array of accentuated Strings

> On Dec 8, 1:09 am, unbewusst.s...@weltanschauung.com.invalid (Une
> Bévue) wrote:
> > cruiserdan <d...@zeraweb.com> wrote:
> >
> > > You might be interested in using some of the emerging Unicode support
> > > in Ruby. Ruby 2 will have it built-in and there are several libraries
> > > out there, although I don't have any experience using them.
> >
> > right, thanks, i've only wrote a workaround before getting Ruby 2...
> > --
> > Une Bévue
>
> Hmmm. Maybe I'm mistaken, but this seems to have nothing to do with
> unicode. An ascii char is always going to be less than a utf-8 char,
> since utf-8 is a superset of ascii.
>
> Fenêtre <=> Être ->
>
> F (\x46) <=> Ê (\xc3\x8a) ->
>
> -1
>
> To get the right behavior I think you have to translate the utf-8
> characters to ascii. You can try something like:
>
> require 'iconv'
> class String
> def translit
> Iconv.iconv('ascii//translit', 'utf-8', self)[0]
> end
> end
> a.sort { | i, j | i.translit <=> j.translit }
>
> But some people have had strange effects from #iconv (e.g., a recent
> thread [1]).
>
> Regards,
> Jordan

Besides that, the problem of sorting accented strings seems to be
somewhat unsolvable, as different natural languages using the
same accents have different conventions.
I'd claim the highest degree of inconsistency in this issue
for the German language (other proposals invited):

- German phone books sort words containing <A-DIAERESIS>,<O-DIAERESIS>,
<U-DIAERESIS>, as if they were spelled with "AE","OE","UE" instead of <A-DIAERESIS> etc.,
- otherwise, the diacritics are quite often just ignored,
- in Austria, including in phone books, diacritics come behind "z" .... (just like in Swedish, where <A-DIAERESIS>,<O-DIAERESIS> are also used (but consistently),
- French and Spanish use diaeresis on some letters to mark that
they have to be pronounced separately (Citro{"e}n,Camag{"u}ey).

(see: http://en.wikipedia.org/wiki...)

How can one establish a single standard, for all (natural) languages
with such a confusion ?

I'd recommend to use a couple of gsub calls, much like Xavier Noria
proposed in his post

http://groups.google.de/group/comp.lang.ruby/browse_thread/thread/9fbb85fa49dd700f/eed035...

and to adapt them to the situation at hand to pre-process the strings
to sort.

Best regards,

Axel

--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/g...

MonkeeSage

12/8/2007 3:09:00 PM

On Dec 8, 8:29 am, Axel Etzold <AEtz...@gmx.de> wrote:
> -------- Original-Nachricht --------
>
>
>
> > Datum: Sat, 8 Dec 2007 22:15:00 +0900
> > Von: MonkeeSage <MonkeeS...@gmail.com>
> > An: ruby-t...@ruby-lang.org
> > Betreff: Re: sorting Array of accentuated Strings
> > On Dec 8, 1:09 am, unbewusst.s...@weltanschauung.com.invalid (Une
> > Bévue) wrote:
> > > cruiserdan <d...@zeraweb.com> wrote:
>
> > > > You might be interested in using some of the emerging Unicode support
> > > > in Ruby. Ruby 2 will have it built-in and there are several libraries
> > > > out there, although I don't have any experience using them.
>
> > > right, thanks, i've only wrote a workaround before getting Ruby 2...
> > > --
> > > Une Bévue
>
> > Hmmm. Maybe I'm mistaken, but this seems to have nothing to do with
> > unicode. An ascii char is always going to be less than a utf-8 char,
> > since utf-8 is a superset of ascii.
>
> > Fenêtre <=> Être ->
>
> > F (\x46) <=> Ê (\xc3\x8a) ->
>
> > -1
>
> > To get the right behavior I think you have to translate the utf-8
> > characters to ascii. You can try something like:
>
> > require 'iconv'
> > class String
> > def translit
> > Iconv.iconv('ascii//translit', 'utf-8', self)[0]
> > end
> > end
> > a.sort { | i, j | i.translit <=> j.translit }
>
> > But some people have had strange effects from #iconv (e.g., a recent
> > thread [1]).
>
> > Regards,
> > Jordan
>
> Besides that, the problem of sorting accented strings seems to be
> somewhat unsolvable, as different natural languages using the
> same accents have different conventions.
> I'd claim the highest degree of inconsistency in this issue
> for the German language (other proposals invited):
>
> - German phone books sort words containing <A-DIAERESIS>,<O-DIAERESIS>,
> <U-DIAERESIS>, as if they were spelled with "AE","OE","UE" instead of <A-DIAERESIS> etc.,
> - otherwise, the diacritics are quite often just ignored,
> - in Austria, including in phone books, diacritics come behind "z" .... (just like in Swedish, where <A-DIAERESIS>,<O-DIAERESIS> are also used (but consistently),
> - French and Spanish use diaeresis on some letters to mark that
> they have to be pronounced separately (Citro{"e}n,Camag{"u}ey).
>
> (see:http://en.wikipedia.org/wiki...)
>
> How can one establish a single standard, for all (natural) languages
> with such a confusion ?

Just to emphasize the point...Greek ? (eta) can be transliterated as
e, e (yet another level of indirection!), h or i. ;)

> I'd recommend to use a couple of gsub calls, much like Xavier Noria
> proposed in his post
>
> http://groups.google.de/group/comp.lang.ruby/browse_thread/......
>
> and to adapt them to the situation at hand to pre-process the strings
> to sort.
>
> Best regards,
>
> Axel
>
> --
> GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
> Alle Infos und kostenlose Anmeldung:http://www.gmx.net/de/g...

Regards,
Jordan

unbewusst.sein

12/8/2007 7:13:00 PM

MonkeeSage <MonkeeSage@gmail.com> wrote:

> An ascii char is always going to be less than a utf-8 char,
> since utf-8 is a superset of ascii.
>
> Fenêtre <=> Être ->
>
> F (\x46) <=> Ê (\xc3\x8a) ->
>
> -1

yes, for sure, but, in order to compare F to Ê i've to know F uses only
one byte and Ê two butes, in other words : decompose a string into an
arrau of characters and compare afterwards...

obviously for the question given by Axel, that's to say ordering between
:

èéêë

various diaresis, i leave the ordering as it is in the unicode number
(UTF-8 in my case)

quiet frankly i don't know what is the french policy for that, i just
want having all e + diaresis between e and f...
--
Une Bévue

comp.lang.ruby

sorting Array of accentuated Strings

unbewusst.sein

Dan Yoder

unbewusst.sein

Rishabh Shrivastava

unbewusst.sein

MonkeeSage

Axel Etzold

MonkeeSage

unbewusst.sein

x Login to ForumsZone