Asp Forum - String is ASCII or UTF-8?

Christopher Benson-Manica

3/9/2010 4:55:00 PM

Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.

11 Answers

Alf P. Steinbach

3/9/2010 5:02:00 PM

* C. Benson Manica:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
> This is python 2.4.3, so I don't have getsizeof available to me.

Generally, if you need 100% certainty then you can't tell the encoding from a
sequence of byte values.

However, if you know that it's EITHER ascii or utf-8 then the presence of any
value above 127 (or, for signed byte values, any negative values), tells you
that it can't be ascii, hence, must be utf-8. And since utf-8 is an extension of
ascii nothing is lost by assuming ascii in the other case. So, problem solved.

If the string represents the contents of a file then you may also look for an
UTF-8 represention of the Unicode BOM (Byte Order Mark) at the beginning. If
found then it indicates utf-8 for almost-sure and more expensive searching can
be avoided. It's just three bytes to check.

Cheers & hth.,

- Alf

Tim Golden

3/9/2010 5:08:00 PM

On 09/03/2010 16:54, C. Benson Manica wrote:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?
> This is python 2.4.3, so I don't have getsizeof available to me.

You can't. You can apply one or more heuristics, depending on exactly
what your requirement is. But any valid ASCII text is also valid
UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
number of bytes per char.

Obviously, you can test whether all the bytes are less than 128 which
suggests that the text is legal ASCII. But then it's also legal UTF8.
Or you can just attempt to decode and catch the exception:

try:
unicode (text, "ascii")
except UnicodeDecodeError:
print "Not ASCII"

TJG

Stef Mientki

3/9/2010 5:13:00 PM

On 09-03-2010 18:02, Alf P. Steinbach wrote:
> * C. Benson Manica:
>> Hours of Googling has not helped me resolve a seemingly simple
>> question - Given a string s, how can I tell whether it's ascii (and
>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>> This is python 2.4.3, so I don't have getsizeof available to me.
>
> Generally, if you need 100% certainty then you can't tell the encoding
> from a sequence of byte values.
>
> However, if you know that it's EITHER ascii or utf-8 then the presence
> of any value above 127 (or, for signed byte values, any negative
> values), tells you that it can't be ascii,
AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.

cheers,
Stef

> hence, must be utf-8. And since utf-8 is an extension of ascii nothing
> is lost by assuming ascii in the other case. So, problem solved.
>
> If the string represents the contents of a file then you may also look
> for an UTF-8 represention of the Unicode BOM (Byte Order Mark) at the
> beginning. If found then it indicates utf-8 for almost-sure and more
> expensive searching can be avoided. It's just three bytes to check.
>
>
> Cheers & hth.,
>
> - Alf

Christopher Benson-Manica

3/9/2010 5:18:00 PM

On Mar 9, 12:07 pm, Tim Golden <m...@timgolden.me.uk> wrote:

> You can't. You can apply one or more heuristics, depending on exactly
> what your requirement is. But any valid ASCII text is also valid
> UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
> number of bytes per char.

Hm, well that's very unfortunate. I'm using a database library which
seems to assume that all strings passed to it are ASCII, and I'm
attempting to use it on two different systems - one where all strings
are ASCII, and one where they seem to be UTF-8. The strings come from
the same place, i.e. they're exclusively normal ASCII characters.
What I would want is to check once for whether the strings passed to
function foo() are ASCII or UTF-8, and if they are to assume that all
strings need to be decoded. So that's not possible?

Richard Brodie

3/9/2010 5:25:00 PM

"C. Benson Manica" <cbmanica@gmail.com> wrote in message
news:98375575-1071-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...

>The strings come from the same place, i.e. they're exclusively
> normal ASCII characters.

In this case then converting them to/from UTF-8 is a no-op, so
it makes no difference at all.

Christopher Benson-Manica

3/9/2010 5:32:00 PM

On Mar 9, 12:24 pm, "Richard Brodie" <R.Bro...@rl.ac.uk> wrote:
> "C. Benson Manica" <cbman...@gmail.com> wrote in messagenews:98375575-1071-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...
>
> >The strings come from the same place, i.e. they're exclusively
> > normal ASCII characters.
>
> In this case then converting them to/from UTF-8 is a no-op, so
> it makes no difference at all.

Except to the database library, which seems perfectly happy to send an
8-character UTF-8 string to the database as 16 raw characters...

Robert Kern

3/9/2010 5:37:00 PM

On 2010-03-09 11:12 AM, Stef Mientki wrote:
> On 09-03-2010 18:02, Alf P. Steinbach wrote:
>> * C. Benson Manica:
>>> Hours of Googling has not helped me resolve a seemingly simple
>>> question - Given a string s, how can I tell whether it's ascii (and
>>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>>> This is python 2.4.3, so I don't have getsizeof available to me.
>>
>> Generally, if you need 100% certainty then you can't tell the encoding
>> from a sequence of byte values.
>>
>> However, if you know that it's EITHER ascii or utf-8 then the presence
>> of any value above 127 (or, for signed byte values, any negative
>> values), tells you that it can't be ascii,
> AFAIK it's completely impossible.
> UTF-8 characters have 1 to 4 bytes / byte.
> I can create ASCII strings containing byte values between 127 and 255.

No, you can't. ASCII strings only have characters in the range 0..127. You could
create Latin-1 (or any number of the 8-bit encodings out there) strings with
characters 0..255, yes, but not ASCII.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Terry Reedy

3/9/2010 5:59:00 PM

On 3/9/2010 11:54 AM, C. Benson Manica wrote:
> Hours of Googling has not helped me resolve a seemingly simple
> question - Given a string s, how can I tell whether it's ascii (and
> thus 1 byte per character) or UTF-8 (and two bytes per character)?

Utf-8 is an encoding that uses 1 to 4 bytes per character.
So it is not clear what you are asking. Alf answered one of the possible
questions.

> This is python 2.4.3, so I don't have getsizeof available to me.

Roel Schroeven

3/9/2010 6:13:00 PM

Op 2010-03-09 18:31, C. Benson Manica schreef:
> On Mar 9, 12:24 pm, "Richard Brodie" <R.Bro...@rl.ac.uk> wrote:
>> "C. Benson Manica" <cbman...@gmail.com> wrote in messagenews:98375575-1071-46af-8ebc-f3c817b47e1d@q23g2000yqd.googlegroups.com...
>>
>>> The strings come from the same place, i.e. they're exclusively
>>> normal ASCII characters.
>>
>> In this case then converting them to/from UTF-8 is a no-op, so
>> it makes no difference at all.
>
> Except to the database library, which seems perfectly happy to send an
> 8-character UTF-8 string to the database as 16 raw characters...

In that case I think you mean UTF-16 or UCS-2 instead of UTF-8. UTF-16
uses 2 or more bytes per character, UCS-2 always uses 2 bytes per
character. UTF-8 uses 1 or more bytes per character.

If your texts are in a Western language, the second byte will be zero in
most characters; you could check for that (but note that the second byte
might be the first one in the byte stream, depending on the byte ordering).

HTH,
Roel

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven

Martin v. Loewis

3/9/2010 8:45:00 PM

> I can create ASCII strings containing byte values between 127 and 255.

No, you can't - or what you create wouldn't be an ASCII string, by
definition of ASCII.

Regards,
Martin

comp.lang.python

String is ASCII or UTF-8?

Christopher Benson-Manica

Alf P. Steinbach

Tim Golden

Stef Mientki

Christopher Benson-Manica

Richard Brodie

Christopher Benson-Manica

Robert Kern

Terry Reedy

Roel Schroeven

Martin v. Loewis

x Login to ForumsZone