Asp Forum - strncmp and unsigned char

me

5/19/2011 9:04:00 PM

Hi guys,

I'm using an utf8 state-machine I made to check and handle unicode
strings, and was wondering if strncmp could be used for comparing the
after check or if I should roll my own?

It's prototype accepts const char and (on linux at least) internally
uses unsigned char.

What should I do?

Regards,

6 Answers

Shao Miller

5/19/2011 9:12:00 PM

On 5/19/2011 4:04 PM, me wrote:
> I'm using an utf8 state-machine I made to check and handle unicode
> strings, and was wondering if strncmp could be used for comparing the
> after check or if I should roll my own?
>
> It's prototype accepts const char and (on linux at least) internally
> uses unsigned char.
>
> What should I do?

Might you be interested in 'wcsncmp()?' :)

Ben Bacarisse

5/19/2011 10:13:00 PM

me <ecosta.tmp@gmail.com> writes:

> I'm using an utf8 state-machine I made to check and handle unicode
> strings, and was wondering if strncmp could be used for comparing the
> after check or if I should roll my own?

This confused me until I decided that a "strings" was missing:

| if strncmp could be used for comparing the [strings] after check[ing]

is that what you meant? If so, you certainly could use strncmp but the
result would be much less useful than a proper Unicode compare. As has
been suggested, you could convert to a wide string an use wcsncmp (or
wcscmp).

However, if all you want is a rather arbitrary ordering (say for a
binary search) then the byte comparison of the UTF8 encoded strings
would do.

> It's prototype accepts const char and (on linux at least) internally
> uses unsigned char.

That's not an issue. All of C's compare functions treat the bytes as if
they were unsigned char, despite the prototypes. If you don't like the
look of the prototype, memcmp uses void *.

--
Ben.

Angel

5/19/2011 10:25:00 PM

On 2011-05-19, Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
>
>> It's prototype accepts const char and (on linux at least) internally
>> uses unsigned char.
>
> That's not an issue. All of C's compare functions treat the bytes as if
> they were unsigned char, despite the prototypes. If you don't like the
> look of the prototype, memcmp uses void *.

Unlike the str*cmp() functions, memcmp() doesn't check for null bytes so
if you do that you might end up comparing garbage data if the strings
are shorter than the given size.

--
"C provides a programmer with more than enough rope to hang himself.
C++ provides a firing squad, blindfold and last cigarette."
- seen in comp.lang.c

christian.bau

5/20/2011 6:03:00 PM

> I'm using an utf8 state-machine I made to check and handle unicode
> strings, and was wondering if strncmp could be used for comparing the
> after check or if I should roll my own?

strcmp will compare strings and return a result assuming that the data
is signed char.
UTF-8 assumes that the string assumes bytes with values from 1 to
255.

What will happen is that strcmp will correctly return 0 if and only if
all Unicode code points are equal. If you check whether the sign is <
0 or > 0, it depends on whether plain char is signed: If it is
unsigned, then the result is the correct result for Unicode code
points as well. If it is signed two's complement, then it will put
Unicode code points >= U0080 before all Unicode code points < U0080.
And since strcmp doesn't tell you where in the strings the difference
was, you can't fix that.

The main problem is that with Unicode, just comparing code points
isn't very meaningful. You'd have to put the code points into a
canonical order at least to get any meaningful result. And when you do
that, using strcmp is quite pointless.

Keith Thompson

5/20/2011 6:47:00 PM

"christian.bau" <christian.bau@cbau.wanadoo.co.uk> writes:
>> I'm using an utf8 state-machine I made to check and handle unicode
>> strings, and was wondering if strncmp could be used for comparing the
>> after check or if I should roll my own?
>
> strcmp will compare strings and return a result assuming that the data
> is signed char.

No, it won't.

strcmp's arguments are of type const char*; plain char may be either
signed or unsigned. But even if plain char is signed, 7.21.4p1 says:

The sign of a nonzero value returned by the comparison functions
memcmp, strcmp, and strncmp is determined by the sign of the
difference between the values of the first pair of characters
(both interpreted as unsigned char) that differ in the objects
being compared.

[...]

> The main problem is that with Unicode, just comparing code points
> isn't very meaningful. You'd have to put the code points into a
> canonical order at least to get any meaningful result. And when you do
> that, using strcmp is quite pointless.

I *think* that strcmp() returns correctly ordered results for UTF-8
strings. UTF-8 was carefully designed to make this work.

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.ne...
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Ben Bacarisse

5/20/2011 10:17:00 PM

Keith Thompson <kst-u@mib.org> writes:

> "christian.bau" <christian.bau@cbau.wanadoo.co.uk> writes:
<snip>
>> The main problem is that with Unicode, just comparing code points
>> isn't very meaningful. You'd have to put the code points into a
>> canonical order at least to get any meaningful result. And when you do
>> that, using strcmp is quite pointless.
>
> I *think* that strcmp() returns correctly ordered results for UTF-8
> strings. UTF-8 was carefully designed to make this work.

It all depends on "correctly ordered" of course. A byte-by-byte compare
of correctly encoded UTF-8 encoded strings preserves the ordering on the
code points the strings represent. To put it another way, converting to
wide strings and using wcscmp will give the same result as strcmp will
when passed the originals. The encoded strings must be not contain any
over-long representations (nor any other forbidden bytes or byte
combinations) but I think the OP has covered that since they talked
about checking the strings first.

However, because Unicode says so much about the characters, one could
argue that a truly correct ordering should be rather more than this.
For example, "fine" with an fi ligature should compare equal to "fine"
without one and so on. If that seems too much like a detail, in some
scripts that code points are not in the correct collating sequence for
even the most basic ordering. That's what Christian Bau is saying, I
think.

--
Ben.

comp.lang.c

strncmp and unsigned char

me

Shao Miller

Ben Bacarisse

Angel

christian.bau

Keith Thompson

Ben Bacarisse

x Login to ForumsZone