Asp Forum - different encodings for unicode() and u''.encode(), bug?

mario

1/2/2008 8:25:00 AM

Hello!

i stumbled on this situation, that is if I decode some string, below
just the empty string, using the mcbs encoding, it succeeds, but if I
try to encode it back with the same encoding it surprisingly fails
with a LookupError. This seems like something to be corrected?

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = ''
>>> unicode(s, 'mcbs')
u''
>>> unicode(s, 'mcbs').encode('mcbs')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mcbs

Best wishes to everyone for 2008!

mario

13 Answers

Martin v. Loewis

1/2/2008 8:30:00 AM

> i stumbled on this situation, that is if I decode some string, below
> just the empty string, using the mcbs encoding, it succeeds, but if I
> try to encode it back with the same encoding it surprisingly fails
> with a LookupError. This seems like something to be corrected?

Indeed - in your code. It's not the same encoding.

>>>> unicode(s, 'mcbs')
> u''
>>>> unicode(s, 'mcbs').encode('mcbs')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> LookupError: unknown encoding: mcbs

Use "mbcs" in the second call, not "mcbs".

HTH,
Martin

mario

1/2/2008 8:45:00 AM

On Jan 2, 9:30 am, "Martin v. Löwis" <mar...@v.loewis.de> wrote:

> Use "mbcs" in the second call, not "mcbs".

Ooops, sorry about that, when i switched to test it in the interpreter
I mistyped "mbcs" with "mcbs". But remark I did it consistently ;-)
I.e. it was still teh same encoding, even if maybe non-existant.. ?

If I try again using "mbcs" consistently, I still get the same error:

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> unicode('', 'mbcs')
u''
>>> unicode('', 'mbcs').encode('mbcs')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mbcs
>>>

mario

John Machin

1/2/2008 9:45:00 AM

On Jan 2, 7:45 pm, mario <ma...@ruggier.org> wrote:
> On Jan 2, 9:30 am, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
>
> > Use "mbcs" in the second call, not "mcbs".
>
> Ooops, sorry about that, when i switched to test it in the interpreter
> I mistyped "mbcs" with "mcbs". But remark I did it consistently ;-)
> I.e. it was still teh same encoding, even if maybe non-existant.. ?
>
> If I try again using "mbcs" consistently, I still get the same error:
>
> $ python
> Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
> [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.>>> unicode('', 'mbcs')
> u''
> >>> unicode('', 'mbcs').encode('mbcs')
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> LookupError: unknown encoding: mbcs

Two things for you to do:

(1) Try these at the Python interactive prompt:

unicode('', 'latin1')
unicode('', 'mbcs')
unicode('', 'raboof')
unicode('abc', 'latin1')
unicode('abc', 'mbcs')
unicode('abc', 'raboof')

(2) Read what the manual (Library Reference -> codecs module ->
standard encodings) has to say about mbcs.

John Machin

1/2/2008 10:47:00 AM

On Jan 2, 8:44 pm, John Machin <sjmac...@lexicon.net> wrote:

> (1) Try these at the Python interactive prompt:
>
> unicode('', 'latin1')

Also use those 6 cases to check out the difference in behaviour
between unicode(x, y) and x.decode(y)

mario

1/2/2008 10:57:00 AM

On Jan 2, 10:44 am, John Machin <sjmac...@lexicon.net> wrote:
>
> Two things for you to do:
>
> (1) Try these at the Python interactive prompt:
>
> unicode('', 'latin1')
> unicode('', 'mbcs')
> unicode('', 'raboof')
> unicode('abc', 'latin1')
> unicode('abc', 'mbcs')
> unicode('abc', 'raboof')

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> unicode('', 'mbcs')
u''
>>> unicode('abc', 'mbcs')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mbcs
>>>

Hmmn, strange. Same behaviour for "raboof".

> (2) Read what the manual (Library Reference -> codecs module ->
> standard encodings) has to say about mbcs.

Page at http://docs.python.org/lib/standard-enco... says that
mbcs "purpose":
Windows only: Encode operand according to the ANSI codepage (CP_ACP)

Do not know what the implications of encoding according to "ANSI
codepage (CP_ACP)" are. Windows only seems clear, but why does it only
complain when decoding a non-empty string (or when encoding the empty
unicode string) ?

mario

John Machin

1/2/2008 11:29:00 AM

On Jan 2, 9:57 pm, mario <ma...@ruggier.org> wrote:
> On Jan 2, 10:44 am, John Machin <sjmac...@lexicon.net> wrote:
>
>
>
> > Two things for you to do:
>
> > (1) Try these at the Python interactive prompt:
>
> > unicode('', 'latin1')
> > unicode('', 'mbcs')
> > unicode('', 'raboof')
> > unicode('abc', 'latin1')
> > unicode('abc', 'mbcs')
> > unicode('abc', 'raboof')
>
> $ python
> Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
> [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.>>> unicode('', 'mbcs')
> u''
> >>> unicode('abc', 'mbcs')
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> LookupError: unknown encoding: mbcs
>
>
>
> Hmmn, strange. Same behaviour for "raboof".
>
> > (2) Read what the manual (Library Reference -> codecs module ->
> > standard encodings) has to say about mbcs.
>
> Page athttp://docs.python.org/lib/standard-encoding... that
> mbcs "purpose":
> Windows only: Encode operand according to the ANSI codepage (CP_ACP)
>
> Do not know what the implications of encoding according to "ANSI
> codepage (CP_ACP)" are.

Neither do I. YAGNI (especially on darwin) so don't lose any sleep
over it.

> Windows only seems clear, but why does it only
> complain when decoding a non-empty string (or when encoding the empty
> unicode string) ?

My presumption: because it doesn't need a codec to decode '' into u'';
no failed codec look-up, so no complaint. Any realistic app will try
to decode a non-empty string sooner or later.

mario

1/2/2008 12:16:00 PM

On Jan 2, 12:28 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Jan 2, 9:57 pm, mario <ma...@ruggier.org> wrote:
>
> > Do not know what the implications of encoding according to "ANSI
> > codepage (CP_ACP)" are.
>
> Neither do I. YAGNI (especially on darwin) so don't lose any sleep
> over it.
>
> > Windows only seems clear, but why does it only
> > complain when decoding a non-empty string (or when encoding the empty
> > unicode string) ?
>
> My presumption: because it doesn't need a codec to decode '' into u'';
> no failed codec look-up, so no complaint. Any realistic app will try
> to decode a non-empty string sooner or later.

Yes, I suspect I will never need it ;)

Incidentally, the situation is that in a script that tries to guess a
file's encoding, it bombed on the file ".svn/empty-file" -- but why it
was going so far with an empty string was really due to a bug
elsewhere in the script, trivially fixed. Still, I was curious about
this non-symmetric behaviour for the empty string by some encodings.

Anyhow, thanks a lot to both of you for the great feedback!

mario

Piet van Oostrum

1/2/2008 1:26:00 PM

>>>>> mario <mario@ruggier.org> (M) wrote:

>M> $ python
>M> Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
>M> [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
>M> Type "help", "copyright", "credits" or "license" for more information.
>>>>> unicode('', 'mbcs')
>M> u''
>>>>> unicode('abc', 'mbcs')
>M> Traceback (most recent call last):
>M> File "<stdin>", line 1, in <module>
>M> LookupError: unknown encoding: mbcs
>>>>>

>M> Hmmn, strange. Same behaviour for "raboof".

Apparently for the empty string the encoding is irrelevant as it will not
be used. I guess there is an early check for this special case in the code.
--
Piet van Oostrum <piet@cs.uu.nl>
URL: http://pietvano... [PGP 8DAE142BE17999C4]
Private email: piet@vanoostrum.org

Martin v. Loewis

1/2/2008 8:49:00 PM

> Do not know what the implications of encoding according to "ANSI
> codepage (CP_ACP)" are. Windows only seems clear, but why does it only
> complain when decoding a non-empty string (or when encoding the empty
> unicode string) ?

It has no implications for this issue here. CP_ACP is a Microsoft
invention of a specific encoding alias - the "ANSI code page"
(as Microsoft calls it) is not a specific encoding where I could
specify a mapping from bytes to characters, but instead a
system-global indirection based on a langage default. For example,
in the Western-European/U.S. version of Windows, the default for
CP_ACP is cp1252 (local installation may change that default,
system-wide).

The issue likely has the cause that Piet also guessed: If the
input is an empty string, no attempt to actually perform an
encoding is done, but the output is assumed to be an empty
string again. This is correct behavior for all codecs that Python
supports in its default installation, at least for the direction
bytes->unicode. For the reverse direction, such an optimization
would be incorrect; consider u"".encode("utf-16").

HTH,
Martin

mario

1/3/2008 9:03:00 PM

On Jan 2, 2:25 pm, Piet van Oostrum <p...@cs.uu.nl> wrote:

> Apparently for the empty string the encoding is irrelevant as it will not
> be used. I guess there is an early check for this special case in the code.

In the module I an working on [*] I am remembering a failed encoding
to allow me, if necessary, to later re-process fewer encodings. In the
case of an empty string AND an unknown encoding this strategy
failed...

Anyhow, the question is, should the behaviour be the same for these
operations, and if so what should it be:

u"".encode("non-existent")
unicode("", "non-existent")

mario

[*] a module to decode heuristically, that imho is actually starting
to look quite good, it is at http://gizmojo.org/cod... and any
comments very welcome.

comp.lang.python

different encodings for unicode() and u''.encode(), bug?

mario

Martin v. Loewis

mario

John Machin

John Machin

mario

John Machin

mario

Piet van Oostrum

Martin v. Loewis

mario

x Login to ForumsZone