[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Re: Mysterious xml.sax Encoding Exception

Stefan Behnel

2/2/2008 4:17:00 PM

Peck, Jon schrieb:
> Yes, the characters were from the 0-127 ascii block but encoded as utf-16, so there is a null byte with each nonzero character. I.e., \x00?\x00x\x00m\x00l\x00
>
> Here is something weird I found while experimenting with ElementTree with this same XML string.
>
> Consider the same XML as a Python Unicode string, so it is actually encoded as utf-16 and as a string containing utf-16 bytes. That is
> u'<?xml version="1.0" encoding="UTF-16" st' ...
> or
> '\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00'...
>
> So if these are x and y
> y = x.encode("utf-16")
>
> The actual bytes would be the same, I think, although y is type str and x is type unicode.

No. The internal representation of unicode characters is platform dependent,
and is either 2 or 4 bytes per character. If you want UTF-16, use ".encode()".


> xml.sax.parseString documentation says
>
> parses from a buffer string received as a parameter,
>
> so one might imagine that either x or y would be acceptable, and the bytes would be interpreted according to the encoding declaration in the byte stream.
>
> And, in fact, both do work with xml.sax.parseString (at least for me). With etree.parse(StringIO.StringIO...) though, only the str form works.

Don't try. Serialised XML is bytes, not characters.

Stefan