Asp Forum - How does unicode() work?

Robert Latest

1/9/2008 12:34:00 PM

Here's a test snippet...

import sys
for k in sys.stdin:
print '%s -> %s' % (k, k.decode('iso-8859-1'))

....but it barfs when actually fed with iso8859-1 characters. How is this
done right?

robert

7 Answers

Robert Latest

1/9/2008 12:42:00 PM

Robert Latest wrote:
> ...but it barfs when actually fed with iso8859-1 characters.

Specifically, it says:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 0:
ordinal not in range(128)

which doesn't make sense to me, because I specifically asked for the
iso8859-1 decoder, not the 'ascii' one.

robert

Fredrik Lundh

1/9/2008 12:45:00 PM

Robert Latest wrote:

> Here's a test snippet...
>
> import sys
> for k in sys.stdin:
> print '%s -> %s' % (k, k.decode('iso-8859-1'))
>
> ...but it barfs when actually fed with iso8859-1 characters. How is this
> done right?

it's '%s -> %s' % (byte string, unicode string) that barfs. try doing

import sys
for k in sys.stdin:
print '%s -> %s' % (repr(k), k.decode('iso-8859-1'))

instead, to see what's going on.

</F>

Carsten Haese

1/9/2008 2:15:00 PM

On Wed, 2008-01-09 at 13:44 +0100, Fredrik Lundh wrote:
> Robert Latest wrote:
>
> > Here's a test snippet...
> >
> > import sys
> > for k in sys.stdin:
> > print '%s -> %s' % (k, k.decode('iso-8859-1'))
> >
> > ...but it barfs when actually fed with iso8859-1 characters. How is this
> > done right?
>
> it's '%s -> %s' % (byte string, unicode string) that barfs. try doing
>
> import sys
> for k in sys.stdin:
> print '%s -> %s' % (repr(k), k.decode('iso-8859-1'))
>
> instead, to see what's going on.

If that really is the line that barfs, wouldn't it make more sense to
repr() the unicode object in the second position?

import sys
for k in sys.stdin:
print '%s -> %s' % (k, repr(k.decode('iso-8859-1')))

Also, I'm not sure if the OP has told us the truth about his code and/or
his error message. The implicit str() call done by formatting a unicode
object with %s would raise a UnicodeEncodeError, not the
UnicodeDecodeError that the OP is reporting. So either I need more
coffee or there is something else going on here that hasn't come to
light yet.

--
Carsten Haese
http://informixdb.sourc...

Fredrik Lundh

1/9/2008 2:34:00 PM

Carsten Haese wrote:

> If that really is the line that barfs, wouldn't it make more sense to
> repr() the unicode object in the second position?
>
> import sys
> for k in sys.stdin:
> print '%s -> %s' % (k, repr(k.decode('iso-8859-1')))
>
> Also, I'm not sure if the OP has told us the truth about his code and/or
> his error message. The implicit str() call done by formatting a unicode
> object with %s would raise a UnicodeEncodeError, not the
> UnicodeDecodeError that the OP is reporting. So either I need more
> coffee or there is something else going on here that hasn't come to
> light yet.

When mixing Unicode with byte strings, Python attempts to decode the
byte string, not encode the Unicode string.

In this case, Python first inserts the non-ASCII byte string in "%s ->
%s" and gets a byte string. It then attempts to insert the non-ASCII
Unicode string, and realizes that it has to convert the (partially
built) target string to Unicode for that to work. Which results in a
*UnicodeDecodeError*.

>>> "%s -> %s" % ("åäö", "åäö")
'\x86\x84\x94 -> \x86\x84\x94'

>>> "%s -> %s" % (u"åäö", u"åäö")
u'\xe5\xe4\xf6 -> \xe5\xe4\xf6'

>>> "%s -> %s" % ("åäö", u"åäö")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x86 ...

(the actual implementation differs a bit from the description above, but
the behaviour is identical).

</F>

Carsten Haese

1/9/2008 2:56:00 PM

On Wed, 2008-01-09 at 15:33 +0100, Fredrik Lundh wrote:
> When mixing Unicode with byte strings, Python attempts to decode the
> byte string, not encode the Unicode string.

Ah, I did not realize that. I never mix Unicode and byte strings in the
first place, and now I know why. Thanks for clearing that up.

--
Carsten Haese
http://informixdb.sourc...

John Machin

1/9/2008 6:25:00 PM

On Jan 10, 1:55 am, Carsten Haese <cars...@uniqsys.com> wrote:
> On Wed, 2008-01-09 at 15:33 +0100, Fredrik Lundh wrote:
> > When mixing Unicode with byte strings, Python attempts to decode the
> > byte string, not encode the Unicode string.
>
> Ah, I did not realize that. I never mix Unicode and byte strings in the
> first place, and now I know why. Thanks for clearing that up.
>

When mixing unicode strings with byte strings, Python attempts to
decode the str object to unicode, not encode the unicode object to
str. This is fine, especially when compared with the alternative, so
long as the str object is (loosely) ASCII. If the str object contains
a byte such that ord(byte) > 127, an exception will be raised.

When mixing floats with ints, Python attempts to decode the int to
float, not encode the float to int. This is fine, especially when
compared with the alternative, so long as the int is not humungous. If
the int is huge, you will lose precision without any warning or any
exception being raised.

Do you avoid mixing ints and floats?

Robert Latest

1/9/2008 8:02:00 PM

John Machin wrote:

> When mixing unicode strings with byte strings, Python attempts to
> decode the str object to unicode, not encode the unicode object to
> str.

Thanks for the explanation. Of course I didn't want to mix Unicode and Latin
in one string, my snippet just tried to illustrate the point. I'm new to
Python -- I came from C, and C gives a rat's ass about encoding. It just
dumps bytes and that's that.

robert

comp.lang.python

How does unicode() work?

Robert Latest

Robert Latest

Fredrik Lundh

Carsten Haese

Fredrik Lundh

Carsten Haese

John Machin

Robert Latest

x Login to ForumsZone