Asp Forum - Is char obsolete?

Lauri Alanko

4/8/2011 1:06:00 PM

I'm beginning to wonder if I should use the char type at all any
more.

An abstract textual character is nowadays a very complex
concept. Perhaps it is best represented as a Unicode code point,
perhaps as something else, but in any case a sensible
representation of an abstract encoding-independent character
cannot fit into a char (which is almost always eight bits wide),
but needs something else: wchar_t, uint32_t, a struct, or
something.

On the other hand, if we are dealing with an encoding-specific
representation, e.g. an ASCII string or UTF-8 string or whatever,
then we'd better deal with it as pure binary data, and that is
more natural to represent as a sequence of unsigned char or
uint8_t.

Perhaps in the olden days it was at least conceptually (if not
practically) useful to have a type char for characters, which
was distinct from signed char and unsigned char which were
for small integers. This made sense in a world where there were
several encodings but all of them were single-byte. The distinct char
type signalled: "this is meant to be a character, not just any number,
don't depend on the characters integer value if you want to be
portable".

But nowadays Unicode is everywhere, and the de facto standard
encoding is UTF-8. The char type won't cut it for characters any
more. And in those rare situations where one can still assume
that all the world is ASCII (or Latin-1, or even EBCDIC), there
is still no benefit to using char over unsigned char. Apart from
legacy library APIs, of course.

So is there any situation where a modern C programmer, without
the baggage of legacy interfaces, should still use the char
type?

Lauri

23 Answers

Chris H

4/8/2011 1:56:00 PM

In message <inn180$1e3$1@oravannahka.helsinki.fi>, Lauri Alanko
<la@iki.fi> writes
>I'm beginning to wonder if I should use the char type at all any
>more.

Yes There are many 8 bit MCU still in wide spread use. Also very many
systems that use ASCII for characters.

>Perhaps in the olden days it was at least conceptually

The olden days are still here.

>But nowadays Unicode is everywhere,

Not everywhere.

>The char type won't cut it for characters any

But it is still needed. The Hayes AT set still uses ASCII

>So is there any situation where a modern C programmer, without
>the baggage of legacy interfaces, should still use the char
>type?

Depends what they are doing

--
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
\/\/\/\/\ Chris Hills Staffs England /\/\/\/\/
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/

Thomas Richter

4/8/2011 3:04:00 PM

On 08.04.2011 15:06, Lauri Alanko wrote:
> I'm beginning to wonder if I should use the char type at all any
> more.
>
> An abstract textual character is nowadays a very complex
> concept.

However, a "char" is not an "abstract textual character". A char is the
smallest addressable memory unit of the system you compile to.

> So is there any situation where a modern C programmer, without
> the baggage of legacy interfaces, should still use the char
> type?

Yes, of course. But not necessarily for text strings. It's the smallest
available integer type.

Greetings,
Thomas

Ben Bacarisse

4/8/2011 3:13:00 PM

Lauri Alanko <la@iki.fi> writes:

> I'm beginning to wonder if I should use the char type at all any
> more.
<snip>
> [...] if we are dealing with an encoding-specific
> representation, e.g. an ASCII string or UTF-8 string or whatever,
> then we'd better deal with it as pure binary data, and that is
> more natural to represent as a sequence of unsigned char or
> uint8_t.

For UTF-8, that is only true for code that pokes about in the
representation. Most code will function perfectly well treating UTF-8
encoded strings as char arrays.

<snip>
> But nowadays Unicode is everywhere, and the de facto standard
> encoding is UTF-8. The char type won't cut it for characters any
> more.

I feel this is a generalisation from a specific issue -- that of
manipulating the representation. Can you say why, in general, char
won't cut it for UTF-8 encoded strings?

<snip>
> So is there any situation where a modern C programmer, without
> the baggage of legacy interfaces, should still use the char
> type?

How can one avoid this baggage? char and char arrays are used for
multi-byte encoded strings throughout the standard library.

--
Ben.

Chris H

4/8/2011 3:33:00 PM

In message <inn85n$qgo$1@news.belwue.de>, Thomas Richter <thor@math.tu-
berlin.de> writes
>On 08.04.2011 15:06, Lauri Alanko wrote:
>> I'm beginning to wonder if I should use the char type at all any
>> more.
>>
>> An abstract textual character is nowadays a very complex
>> concept.
>
>However, a "char" is not an "abstract textual character". A char is the
>smallest addressable memory unit of the system you compile to.

Mostly.... I think you are confusing it with Byte (which does not mean 8
bits hence Octet)

Char is the smallest using to hold a Character which may not be the same
thing.

>> So is there any situation where a modern C programmer, without
>> the baggage of legacy interfaces, should still use the char
>> type?
>
>Yes, of course. But not necessarily for text strings. It's the smallest
>available integer type.

That is not correct. There are THREE char types.

Signed char
unsigned char
[plain] char

Signed and unsigned char are integer types.
[Plain] char is a character type NOT an integer type. If can be mapped
to signed or unsigned are the whim of the compiler implimentor. Come
compilers give you the option.

--
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
\/\/\/\/\ Chris Hills Staffs England /\/\/\/\/
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/

Ben Bacarisse

4/8/2011 4:02:00 PM

Chris H <chris@phaedsys.org> writes:

> In message <inn85n$qgo$1@news.belwue.de>, Thomas Richter <thor@math.tu-
> berlin.de> writes
<snip>
>>[sppeaking of char] It's the smallest available integer type.
>
> That is not correct. There are THREE char types.
>
> Signed char
> unsigned char
> [plain] char
>
> Signed and unsigned char are integer types.
> [Plain] char is a character type NOT an integer type.

6.2.5 p17:

"The type char, the signed and unsigned integer types, and the
enumerated types are collectively called integer types. [...]"

You may be remembering the rather less the helpful term "standard
integer types" defined by paragraphs 4, 6 and 7 of the same section.

<snip>
--
Ben.

ram

4/8/2011 4:11:00 PM

Lauri Alanko <la@iki.fi> writes:
>I'm beginning to wonder if I should use the char type at all any
>more.

>An abstract textual character is nowadays a very complex
>concept.

If someone believes that he should use »char« for
characters, then does he also believe that he should use
»float« for floating point numbers?

For characters, you use whatever representation is
appropriate for the project.

>So is there any situation where a modern C programmer, without
>the baggage of legacy interfaces, should still use the char
>type?

A char object is used to store a member of the basic
execution character set.

It can also be used for objects that should have the
size 1 or - via arrays of char - the size n.

Lauri Alanko

4/8/2011 5:04:00 PM

Some clarifications.

Firstly, I'm talking specifically about the type "char". The
types "signed char" and "unsigned char" are perfectly
useful (though idiosyncratically named) integer types for
operating on the smallest addressable memory
units (a.k.a. bytes).

The type "char" is distinct from these, and it is strictly less
useful as an integer (due to its implementation-specific
signedness). So the only justification for it that I can see is
that it serves as a semantic annotation: a char is a byte that is
intended to be interpreted as a character in the basic execution
character set.

But I'm saying that nowadays the basic execution character set no
longer suffices for general-purpose text manipulation. So
wherever you need to manipulate an individual character as a
character, you'd better use wchar_t or similar.

The standard library uses the type "char *" for its
representation of strings: a zero-terminated sequence of bytes
where one or more adjacent bytes represent a single
character. This is fine (although the choice of representation is
questionable), but the type name is confusing: a "char *" is a
pointer to a variable-length string structure, not to
a "character" as such. If we have "char* s;" then the only thing
we know about the meaning of "s[i]" is that it is a byte that is
part of the encoding of a string. This particular use of bytes
hardly seems to be worth a distinct primitive type.

In any case, we are not forced to use the standard library as
such. Yes, it is "baggage", but it is easy to throw away: the
library is finite and it is simple to rewrite it or wrap it to a
different API. Perhaps one where we just have a "struct string"
abstract type and string operations take a "struct string*" as an
argument. We can even support string literals:

struct string {
size_t len;
unsigned char* data;
};

#define string_lit(s) &(struct string) { \
.len = sizeof(s) - 1, \
.data = (unsigned char[sizeof(s) - 1]){ s } \
}

So, if we program in a modern style and don't use the standard library
string operations directly, I'd say we no longer have a very good
reason to use "char" anywhere in the application code.

Lauri

John Doe

4/8/2011 7:11:00 PM

On Fri, 08 Apr 2011 17:04:14 +0000, Lauri Alanko wrote:

> Firstly, I'm talking specifically about the type "char". The
> types "signed char" and "unsigned char" are perfectly
> useful (though idiosyncratically named) integer types for
> operating on the smallest addressable memory
> units (a.k.a. bytes).
>
> The type "char" is distinct from these, and it is strictly less
> useful as an integer (due to its implementation-specific
> signedness). So the only justification for it that I can see is
> that it serves as a semantic annotation: a char is a byte that is
> intended to be interpreted as a character in the basic execution
> character set.

It has another justification: efficiency. Most of the time, you don't
actually care whether char is signed or unsigned, or even integral for
that matter.

> The standard library uses the type "char *" for its
> representation of strings:

> In any case, we are not forced to use the standard library as
> such.

You are if you want to interface to the OS. Aside from the ANSI/ISO C
functions, the Unix API uses char* extensively (Windows, OTOH, uses wide
characters).

Keith Thompson

4/8/2011 7:56:00 PM

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> Lauri Alanko <la@iki.fi> writes:
>> I'm beginning to wonder if I should use the char type at all any
>> more.
> <snip>
>> [...] if we are dealing with an encoding-specific
>> representation, e.g. an ASCII string or UTF-8 string or whatever,
>> then we'd better deal with it as pure binary data, and that is
>> more natural to represent as a sequence of unsigned char or
>> uint8_t.
>
> For UTF-8, that is only true for code that pokes about in the
> representation. Most code will function perfectly well treating UTF-8
> encoded strings as char arrays.
>
> <snip>
>> But nowadays Unicode is everywhere, and the de facto standard
>> encoding is UTF-8. The char type won't cut it for characters any
>> more.
>
> I feel this is a generalisation from a specific issue -- that of
> manipulating the representation. Can you say why, in general, char
> won't cut it for UTF-8 encoded strings?

In principle, if plain char is signed (and let's assume CHAR_BIT==8),
then the result of converting an octet with a value exceeding 127
yields an implementation-define result, and interpreting such an
octet as a plain char may not give you the value you expect. It's
even conceivable that round-trip conversions might lose information
(if plain char is signed and has distinct representations for +0
and -0).

I practice, it all Just Works on any system you're likely to
encounter; the only plausible exceptions are embedded systems
that aren't likely to be dealing with this kind of data anyway.
If plain char is represented in 2's-complement, and if conversions
between corresponding signed and unsigned types simply reinterpret
the representation, then things work as expected. And any vendor
who introduced a system that violates these assumptions (without
some overwhelming reason for doing so) will probably go out of
business while citing the section of the Standard that says their
implementation is conforming.

[...]

If I were designing C from scratch today, I'd probably at least require
plain char to be unsigned (and INT_MAX > UCHAR_MAX) just to avoid these
potential issues.

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.ne...
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Bartc

4/8/2011 8:05:00 PM

"Lauri Alanko" <la@iki.fi> wrote in message
news:innf6e$5v5$1@oravannahka.helsinki.fi...
> Some clarifications.
>
> Firstly, I'm talking specifically about the type "char". The
> types "signed char" and "unsigned char" are perfectly
> useful (though idiosyncratically named) integer types for
> operating on the smallest addressable memory
> units (a.k.a. bytes).

If you mean don't use a 'char' type where it's signedness can be unknown,
then I'd agree. Too many things will not work properly when the signedness
of char is opposite to what is assumed.

However, I don't believe the 'char' type in C is anything specifically to do
with characters; it should really have been signed and unsigned byte, ie.
just a small integer (and the smallest addressable unit).

To represent characters, then unsigned byte is most appropriate, and can
even work well for a lot of unicode stuff, either by using utf-8, or just
sticking to the first 256 codes.

> But I'm saying that nowadays the basic execution character set no
> longer suffices for general-purpose text manipulation. So
> wherever you need to manipulate an individual character as a
> character, you'd better use wchar_t or similar.

To do unicode properly, I don't think it's just a question of using a
slightly wider type; you'll probably be using extra libraries anyway.

But I would guess that a huge amount of code in C works quite happily using
8-bit characters.

--
Bartc

comp.lang.c

Is char obsolete?

Lauri Alanko

Chris H

Thomas Richter

Ben Bacarisse

Chris H

Ben Bacarisse

ram

Lauri Alanko

John Doe

Keith Thompson

Bartc

x Login to ForumsZone