Ben Bacarisse
5/20/2011 10:17:00 PM
Keith Thompson <kst-u@mib.org> writes:
> "christian.bau" <christian.bau@cbau.wanadoo.co.uk> writes:
<snip>
>> The main problem is that with Unicode, just comparing code points
>> isn't very meaningful. You'd have to put the code points into a
>> canonical order at least to get any meaningful result. And when you do
>> that, using strcmp is quite pointless.
>
> I *think* that strcmp() returns correctly ordered results for UTF-8
> strings. UTF-8 was carefully designed to make this work.
It all depends on "correctly ordered" of course. A byte-by-byte compare
of correctly encoded UTF-8 encoded strings preserves the ordering on the
code points the strings represent. To put it another way, converting to
wide strings and using wcscmp will give the same result as strcmp will
when passed the originals. The encoded strings must be not contain any
over-long representations (nor any other forbidden bytes or byte
combinations) but I think the OP has covered that since they talked
about checking the strings first.
However, because Unicode says so much about the characters, one could
argue that a truly correct ordering should be rather more than this.
For example, "fine" with an fi ligature should compare equal to "fine"
without one and so on. If that seems too much like a detail, in some
scripts that code points are not in the correct collating sequence for
even the most basic ordering. That's what Christian Bau is saying, I
think.
--
Ben.