Asp Forum - Parse XML that isn't well formed

Milo Thurston

9/19/2007 10:06:00 AM

I have some XML looking like the following, other than being very much
larger (some files are up to 2GB):

<?xml version="1.0" encoding="UTF-8"?>
<server_url>http://myserver.edu/data/</serv...
<server_name>myserver.edu</server_name>
<uploads>
<result>
<dir>/storage/data/results/</dir>
<result_name>hadcm3l_00012_00000118_0</result_name>
<file_info>
<name>hadcm3l_00012_00000118_0_6.zip</name>
<nbytes>5154055</nbytes>
<md5_checksum>485600296bb601ab4a3d1d49a9fb1c86</md5_checksum>
</file_info>
<file_info>
<name>hadcm3l_00012_00000118_0_7.zip</name>
<nbytes>5153055</nbytes>
<md5_checksum>36a600296cb60229a3d1d49a9fb1a10</md5_checksum>
</file_info>
</result>
</uploads>
</xml>

I've tried a few xml parsers such as xml-simple, libxml and quixml, but
all reject this data as badly formed. One answer would, of course, be
for the data to be re-generated using properly formed xml. Meanwhile, is
there anything that could be done with the existing files? Is it a case
of having to write regexps to parse this sort of thing?
--
Posted via http://www.ruby-....

4 Answers

Alex LeDonne

9/19/2007 6:53:00 PM

On 9/19/07, Milo Thurston <knirirr@gmail.com> wrote:
> I have some XML looking like the following, other than being very much
> larger (some files are up to 2GB):
>
> <?xml version="1.0" encoding="UTF-8"?>
> <server_url>http://myserver.edu/data/</serv...
> <server_name>myserver.edu</server_name>
> <uploads>
> <result>
> <dir>/storage/data/results/</dir>
> <result_name>hadcm3l_00012_00000118_0</result_name>
> <file_info>
> <name>hadcm3l_00012_00000118_0_6.zip</name>
> <nbytes>5154055</nbytes>
> <md5_checksum>485600296bb601ab4a3d1d49a9fb1c86</md5_checksum>
> </file_info>
> <file_info>
> <name>hadcm3l_00012_00000118_0_7.zip</name>
> <nbytes>5153055</nbytes>
> <md5_checksum>36a600296cb60229a3d1d49a9fb1a10</md5_checksum>
> </file_info>
> </result>
> </uploads>
> </xml>
>

Note that there should be no </xml> - the line at the top is a
declaration, not an opening tag. Where did </xml> come from? What
happens if you remove that from the data?

-A

Milo Thurston

9/20/2007 1:27:00 PM

Alex LeDonne wrote:
> Note that there should be no </xml> - the line at the top is a
> declaration, not an opening tag. Where did </xml> come from? What
> happens if you remove that from the data?

Good point about the XML. Unfortunately, these are the files I have
received and have to deal with them for now.

Removing the final tag gives:

file.xml:3: parser error : Extra content at the end of the document
<server_name>myserver.edu</server_name>
^
rake aborted!
--
Posted via http://www.ruby-....

Jano Svitok

9/21/2007 9:30:00 AM

On 9/20/07, Milo Thurston <knirirr@gmail.com> wrote:
> Alex LeDonne wrote:
> > Note that there should be no </xml> - the line at the top is a
> > declaration, not an opening tag. Where did </xml> come from? What
> > happens if you remove that from the data?
>
>
> Good point about the XML. Unfortunately, these are the files I have
> received and have to deal with them for now.
>
> Removing the final tag gives:
>
> .file.xml:3: parser error : Extra content at the end of the document
> <server_name>myserver.edu</server_name>
> ^
> rake aborted!

You should have done two things: 1. add root node <server> (with
closing </server> just before </xml>) AND 2. remove the trailing
</xml>

Then it'll be fine.

in your case it's easy:

data.gsub('?>', '?><server>').gsub('</xml>', '</server>')

Milo Thurston

9/21/2007 9:41:00 AM

Jano Svitok wrote:>
> You should have done two things: 1. add root node <server> (with
> closing </server> just before </xml>) AND 2. remove the trailing
> </xml>

Great, thanks.
That should sort out the "legacy" files, and future ones can be
corrected.

I have also been parsing each line with IO.foreach and
/<(.+)[^>]*>(.+?)<(\/.+)>/, which though not as nice as a proper XML
parser does avoid loading huge files into memory in one go.
--
Posted via http://www.ruby-....

comp.lang.ruby

Parse XML that isn't well formed

Milo Thurston

Alex LeDonne

Milo Thurston

Jano Svitok

Milo Thurston

x Login to ForumsZone