[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Mysterious xml.sax Encoding Exception

JKPeck

2/1/2008 6:04:00 PM

I have a module that uses xml.sax and feeds it a string of xml as in
xml.sax.parseString(dictfile,handler)

The xml is always encoded in utf-16, and the XML string always starts
with
<?xml version="1.0" encoding="UTF-16" standalone="no"?>

This almost always works fine, but two users of this module get an
exception whatever input they use it on. (The actual xml is generated
by an api in our application that returns an xml version of metadata
associated with the application's data.)

The exception is
xml.sax._exceptions.SAXParseException: <unknown>:1:30: encoding
specified in XML declaration is incorrect.

In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.

Now here is the hard part: This never happens to me, and having gotten
the actual xml content from one of the users and fed it to the parser,
I don't get the exception.

What could be going on? We are all on Python 2.5 (and all on an
English locale).

Any suggestions would be appreciated.
-Jon Peck
10 Answers

Martin v. Loewis

2/1/2008 8:23:00 PM

0

> In both of these cases, there are only plain, 7-bit ascii characters
> in the xml, and it really is valid utf-16 as far as I can tell.

What do you mean by "7-bit ascii characters"? If it means what I think
it means (namely, a sequence of bytes whose values are between 1 and
127), then it is *not* valid utf-16.

> Now here is the hard part: This never happens to me, and having gotten
> the actual xml content from one of the users and fed it to the parser,
> I don't get the exception.
>
> What could be going on? We are all on Python 2.5 (and all on an
> English locale).

What operating system do they use, and how do they send you the file
for verification? Can you have them run

print repr(open(filename, "rb").read(10))

and send you its output?

Regards,
Martin

JKPeck

2/1/2008 8:31:00 PM

0

On Feb 1, 1:22 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> > In both of these cases, there are only plain, 7-bit ascii characters
> > in the xml, and it really is valid utf-16 as far as I can tell.
>
> What do you mean by "7-bit ascii characters"? If it means what I think
> it means (namely, a sequence of bytes whose values are between 1 and
> 127), then it is *not* valid utf-16.
>
> > Now here is the hard part: This never happens to me, and having gotten
> > the actual xml content from one of the users and fed it to the parser,
> > I don't get the exception.
>
> > What could be going on? We are all on Python 2.5 (and all on an
> > English locale).
>
> What operating system do they use, and how do they send you the file
> for verification? Can you have them run
>
> print repr(open(filename, "rb").read(10))
>
> and send you its output?
>
> Regards,
> Martin

They sent me the actual file, which was created on Windows, as an
email attachment. They had also sent the actual dataset from which
the XML was generated so that I could generate it myself using the
same version of our app as the user has. I did that but did not get
an exception.

Martin v. Loewis

2/1/2008 8:52:00 PM

0

> They sent me the actual file, which was created on Windows, as an
> email attachment. They had also sent the actual dataset from which
> the XML was generated so that I could generate it myself using the
> same version of our app as the user has. I did that but did not get
> an exception.

So are you sure you open the file in binary mode on Windows?

Regards,
Martin

JKPeck

2/1/2008 9:13:00 PM

0

On Feb 1, 1:51 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> > They sent me the actual file, which was created on Windows, as an
> > email attachment. They had also sent the actual dataset from which
> > the XML was generated so that I could generate it myself using the
> > same version of our app as the user has. I did that but did not get
> > an exception.
>
> So are you sure you open the file in binary mode on Windows?
>
> Regards,
> Martin

In the real case, the xml never goes through a file but is handed
directly to the parser. The api return a Python Unicode string
(utf-16). For the file the user sent, if I open it in binary mode, it
still has a BOM; otherwise the BOM is removed. But either version
works on my system.

The basic fact, though, remains, the same code works for me with the
same input but not for two particular users (out of hundreds).

Regards,
Jon

Martin v. Loewis

2/2/2008 2:42:00 AM

0

> The basic fact, though, remains, the same code works for me with the
> same input but not for two particular users (out of hundreds).

I see. That's mysterious.

Regards,
Martin

Jeroen Ruigrok van der Werven

2/2/2008 7:57:00 AM

0

-On [20080201 19:06], JKPeck (JKPeck@gmail.com) wrote:
>In both of these cases, there are only plain, 7-bit ascii characters
>in the xml, and it really is valid utf-16 as far as I can tell.

Did you mean to say that the only characters they used in the UTF-16 encoded
file are characters from the Basic Latin Unicode block?

--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
ã?¤ã?§ã?«ã?¼ã?³ ã?©ã?¦ã??ã?­ã??ã?¯ ã?´ã?¡ã?³ ã??ã?« ã?¦ã?§ã?«ã?´ã?§ã?³
http://www.in-n... | http://www.ra...
We have met the enemy and they are ours...

John Machin

2/2/2008 9:20:00 AM

0

On Feb 2, 8:12 am, JKPeck <JKP...@gmail.com> wrote:
> On Feb 1, 1:51 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
>
> > > They sent me the actual file, which was created on Windows, as an
> > > email attachment. They had also sent the actual dataset from which
> > > the XML was generated so that I could generate it myself using the
> > > same version of our app as the user has. I did that but did not get
> > > an exception.
>
> > So are you sure you open the file in binary mode on Windows?
>
> > Regards,
> > Martin
>
> In the real case, the xml never goes through a file but is handed
> directly to the parser. The api return a Python Unicode string
> (utf-16).

A Python unicode object is *NOT* the UTF-16 that the SAX parser is
expecting. It is expecting a str object which is Unicode text encoded
as UTF-16.

>>> unicode = u'abcde'
>>> unicode_obj = u'abcde'
>>> str_obj = unicode_obj.encode('UTF-16')
>>> print repr(unicode_obj)
u'abcde'
>>> print repr(str_obj)
'\xff\xfea\x00b\x00c\x00d\x00e\x00'
>>>

At the end of this post is code using a str object (works) then
attempting to use a unicode object (reproduces your error message).

> For the file the user sent, if I open it in binary mode, it
> still has a BOM; otherwise the BOM is removed. But either version
> works on my system.
>
> The basic fact, though, remains, the same code works for me with the
> same input but not for two particular users (out of hundreds).

If the real case doesn't involve a file, I can't imagine what you can
infer from a file that isn't used [strike 1] sent to you by a user
[strike 2].

Consider trapping the exception, write repr(the_xml_document_string[:
80]) to the log file and re-raise the exception. Get the user to run
the app. You inspect the log file.

Here's the promised code and results.

C:\junk>type utf16sax.py
import xml.sax, xml.sax.saxutils
import cStringIO
asciistr = 'qwertyuiop'
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
unicode_doc = (xml_template % ('UTF-16', asciistr)).decode('ascii')
utf16_doc = unicode_doc.encode('UTF-16')
for doc in (utf16_doc, unicode_doc):
print
print 'doc = ', repr(doc)
print
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(doc, handler)
result = f.getvalue()
f.close()
start = result.find('<data>') + 6
end = result.find('</data>')
mydata = result[start:end]
print "SAX output (UTF-8): %r" % mydata


C:\junk>utf16sax.py

doc = '\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i
\x00o\x00n\x0
0=\x00"\x001\x00.\x000\x00"\x00 \x00e\x00n\x00c\x00o\x00d\x00i\x00n
\x00g\x00=\x0
0"\x00U\x00T\x00F\x00-\x001\x006\x00"\x00?\x00>\x00<\x00d\x00a\x00t
\x00a\x00>\x0
0q\x00w\x00e\x00r\x00t\x00y\x00u\x00i\x00o\x00p\x00<\x00/\x00d\x00a
\x00t\x00a\x0
0>\x00'

SAX output (UTF-8): 'qwertyuiop'

doc = u'<?xml version="1.0" encoding="UTF-16"?><data>qwertyuiop</
data>'

Traceback (most recent call last):
File "C:\junk\utf16sax.py", line 13, in <module>
xml.sax.parseString(doc, handler)
File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
parser.parse(inpsrc)
File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: encoding
specified in XML
declaration is incorrect

I guess what is happening is that the unicode is coerced to str using
the default encoding (ascii) then it looks at the result, parses out
the "UTF-16", attempts to decode it using utf-16, fails, complains.

HTH,
John

JKPeck

2/4/2008 10:02:00 PM

0

On Feb 2, 12:56 am, Jeroen Ruigrok van der Werven <asmo...@in-
nomine.org> wrote:
> -On [20080201 19:06], JKPeck (JKP...@gmail.com) wrote:
>
> >In both of these cases, there are only plain, 7-bit ascii characters
> >in the xml, and it really is valid utf-16 as far as I can tell.
>
> Did you mean to say that the only characters they used in the UTF-16 encoded
> file are characters from the Basic Latin Unicode block?
>
> --
> Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
> ????? ?????? ??? ?? ??????http://www.in-n...|http://www.ra...
> We have met the enemy and they are ours...

It appears that the root cause of this problem is indeed passing a
Unicode XML string to xml.sax.parseString with an encoding declaration
in the XML of utf-16. This works with the standard distribution on
Windows. It does not work with ActiveState on Windows even though
both distributions report
64K for sys.maxunicode.

So I don't know why the results are different, but the problem is
solved by encoding the Unicode string into utf-16 before passing it to
the parser.

Thanks to all for helping to track this down.

Regards,
Jon Peck

John Machin

2/4/2008 11:10:00 PM

0

On Feb 5, 9:02 am, JKPeck <JKP...@gmail.com> wrote:
> On Feb 2, 12:56 am, Jeroen Ruigrok van der Werven <asmo...@in-
>
> nomine.org> wrote:
> > -On [20080201 19:06], JKPeck (JKP...@gmail.com) wrote:
>
> > >In both of these cases, there are only plain, 7-bit ascii characters
> > >in the xml, and it really is valid utf-16 as far as I can tell.
>
> > Did you mean to say that the only characters they used in the UTF-16 encoded
> > file are characters from the Basic Latin Unicode block?
>
>
> It appears that the root cause of this problem is indeed passing a
> Unicode XML string to xml.sax.parseString with an encoding declaration
> in the XML of utf-16. This works with the standard distribution on
> Windows.

It did NOT work for me with the standard 2.5.1 Windows distribution --
see the code + output that I posted.

> It does not work with ActiveState on Windows even though
> both distributions report
> 64K for sys.maxunicode.
>
> So I don't know why the results are different, but the problem is
> solved by encoding the Unicode string into utf-16 before passing it to
> the parser.

JKPeck

2/5/2008 3:42:00 PM

0

On Feb 4, 4:09 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Feb 5, 9:02 am, JKPeck <JKP...@gmail.com> wrote:
>
>
>
> > On Feb 2, 12:56 am, Jeroen Ruigrok van der Werven <asmo...@in-
>
> > nomine.org> wrote:
> > > -On [20080201 19:06], JKPeck (JKP...@gmail.com) wrote:
>
> > > >In both of these cases, there are only plain, 7-bit ascii characters
> > > >in the xml, and it really is valid utf-16 as far as I can tell.
>
> > > Did you mean to say that the only characters they used in the UTF-16 encoded
> > > file are characters from the Basic Latin Unicode block?
>
> > It appears that the root cause of this problem is indeed passing a
> > Unicode XML string to xml.sax.parseString with an encoding declaration
> > in the XML of utf-16. This works with the standard distribution on
> > Windows.
>
> It did NOT work for me with the standard 2.5.1 Windows distribution --
> see the code + output that I posted.
>
> > It does not work with ActiveState on Windows even though
> > both distributions report
> > 64K for sys.maxunicode.
>
> > So I don't know why the results are different, but the problem is
> > solved by encoding the Unicode string into utf-16 before passing it to
> > the parser.

Interesting. In the course of installing and testing with
ActiveState, I upgraded from the standard distribution 2.5.0 to
2.5.1. The former worked; the latter does not (with the original
code). So that ..1 seems to matter here, and that probably accounts
for why ActiveState raised the exception and the standard 2.5.0 did
not.

-Jon