[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.c

strcmp implementation with unsigned conversion

jianhua

5/23/2011 7:00:00 PM

1. Why the standard* requires that values of the pair of characters
be
"both interpreted as unsigned char"

2. Can they be both interpreted as other larger types
eg. int, unsigned int, long, unsigned long

3. Does "both interpreted as unsigned char" means

this is wrong:

int strcmp(const char *cs, const char *ct)
{
while (1) {
if (*cs != *ct)
return *cs < *ct ? -1 : 1;
if (!*cs)
break;
cs++, ct++;
}
return 0;
}

but this is right:

/*
Copyright (C) 1991, 1992 Linus Torvalds
*/
int strcmp(const char *cs, const char *ct)
{
unsigned char c1, c2;

while (1) {
c1 = *cs++; /*gcc warning: -Wconversion*/
c2 = *ct++; /*gcc warning: -Wconversion*/
if (c1 != c2)
return c1 < c2 ? -1 : 1;
if (!c1)
break;
}
return 0;
}


[*] 7.23.4 Comparison functions
1 The sign of a nonzero value returned by the comparison functions
memcmp, strcmp,
and strncmp is determined by the sign of the difference between the
values of the first
pair of characters (both interpreted as unsigned char) that differ in
the objects being
compared.
3 Answers

Keith Thompson

5/23/2011 7:39:00 PM

0

jianhua <jhlicc@gmail.com> writes:
> 1. Why the standard* requires that values of the pair of characters
> be
> "both interpreted as unsigned char"

So that it gives consistent results for characters outside the range
0..SCHAR_MAX (commonly 0..127).

ASCII, for example, is a strictly 7-bit character set, so any values
outside the range 0..127 are not valid characters. But most other
character sets, including EBCDIC and modern ASCII-based sets such as
Latin-1 and the various Unicode representations, do have meaningful
character values above 127.

For example, the copyright sign has the code 194 (0xc2) in Latin-1. If
plain char is signed, storing the value 194 in a char object will
probably cause it to be stored as -62; if plain char is unsigned, it's
just stored as 194. By interpreting the stored value *as if* it were an
unsigned char, strcmp() consistently treats the copyright sign as being
greater than, for example, the letter 'c'. Without this requirement,
collation sequences could differ depending on whether the compiler
chooses to make plain char signed or unsigned.

In principle, I'm not sure that the semantics are entirely well defined.
In practice, it works.

> 2. Can they be both interpreted as other larger types
> eg. int, unsigned int, long, unsigned long

I'm not even sure what that means. The phrase "interpreted as" means, I
think, that the representation of the char object is treated as if it
were an unsigned char object. I don't think it makes sense to treat a
char object as something bigger than one byte.

> 3. Does "both interpreted as unsigned char" means
>
> this is wrong:
>
> int strcmp(const char *cs, const char *ct)
> {
> while (1) {
> if (*cs != *ct)
> return *cs < *ct ? -1 : 1;
> if (!*cs)
> break;
> cs++, ct++;
> }
> return 0;
> }
>
> but this is right:
>
> /*
> Copyright (C) 1991, 1992 Linus Torvalds
> */
> int strcmp(const char *cs, const char *ct)
> {
> unsigned char c1, c2;
>
> while (1) {
> c1 = *cs++; /*gcc warning: -Wconversion*/
> c2 = *ct++; /*gcc warning: -Wconversion*/
> if (c1 != c2)
> return c1 < c2 ? -1 : 1;
> if (!c1)
> break;
> }
> return 0;
> }

Yes. More precisely, the first is not portable; it works just fine if
plain char is unsigned, but it gives incorrect results if plain char is
signed and some of the values being compared exceed SCHAR_MAX.

>
> [*] 7.23.4 Comparison functions
> 1 The sign of a nonzero value returned by the comparison functions
> memcmp, strcmp, and strncmp is determined by the sign of the
> difference between the values of the first pair of characters (both
> interpreted as unsigned char) that differ in the objects being
> compared.

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.ne...
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Eric Sosman

5/23/2011 10:23:00 PM

0

On 5/23/2011 3:38 PM, Keith Thompson wrote:
> jianhua<jhlicc@gmail.com> writes:
>> [...]
>> but this is right:
>>
>> /*
>> Copyright (C) 1991, 1992 Linus Torvalds
>> */
>> int strcmp(const char *cs, const char *ct)
>> {
>> unsigned char c1, c2;
>>
>> while (1) {
>> c1 = *cs++; /*gcc warning: -Wconversion*/
>> c2 = *ct++; /*gcc warning: -Wconversion*/
>> if (c1 != c2)
>> return c1< c2 ? -1 : 1;
>> if (!c1)
>> break;
>> }
>> return 0;
>> }
>
> Yes. More precisely, the first is not portable; it works just fine if
> plain char is unsigned, but it gives incorrect results if plain char is
> signed and some of the values being compared exceed SCHAR_MAX.

I don't think the latter is perfectly portable, either (though
it's portable to all the machines Torvalds was concerned with). On
systems with signed char using ones' complement or signed magnitude
representation, both plain zero and minus zero would convert to zero
as unsigned char (if the latter conversion didn't trap), and would
then be indistinguishable. It's my belief that strcmp() et al.
should treat minus zero as greater than plain zero, because the
former has a 1-bit while the latter does not.

In short, I don't think the Standard's "interpreted as" can be
taken to have the same meaning as "converted to."

--
Eric Sosman
esosman@ieee-dot-org.invalid

pete

5/24/2011 3:28:00 AM

0

jianhua wrote:
>
> 1. Why the standard* requires that values
> of the pair of characters be
> "both interpreted as unsigned char"
>
> 2. Can they be both interpreted as other larger types
> eg. int, unsigned int, long, unsigned long
>
> 3. Does "both interpreted as unsigned char" means
>
> this is wrong:
>
> int strcmp(const char *cs, const char *ct)
> {
> while (1) {
> if (*cs != *ct)
> return *cs < *ct ? -1 : 1;
> if (!*cs)
> break;
> cs++, ct++;
> }
> return 0;
> }
>
> but this is right:
>
> /*
> Copyright (C) 1991, 1992 Linus Torvalds
> */
> int strcmp(const char *cs, const char *ct)
> {
> unsigned char c1, c2;
>
> while (1) {
> c1 = *cs++; /*gcc warning: -Wconversion*/
> c2 = *ct++; /*gcc warning: -Wconversion*/

I agree with Eric Sosman that a conversion to unsigned char
is different from being interpreted as unsigned char.

It would be OK this way:


c1 = *(unsigned char *)cs++;
c2 = *(unsigned char *)ct++;

> if (c1 != c2)
> return c1 < c2 ? -1 : 1;
> if (!c1)
> break;
> }
> return 0;
> }
>
> [*] 7.23.4 Comparison functions
> 1 The sign of a nonzero value returned by the comparison functions
> memcmp, strcmp,
> and strncmp is determined by the sign of the difference between the
> values of the first
> pair of characters (both interpreted as unsigned char) that differ in
> the objects being
> compared.

--
pete