Asp Forum - Processing a huge xml file

Tim Perrett

7/23/2007 11:28:00 AM

Hey guys

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

Cheers

Tim
--
Posted via http://www.ruby-....

12 Answers

Lloyd Linklater

7/23/2007 12:51:00 PM

Tim Perrett wrote:

> I was wondering what advice anyone could possibly hand me about
> processing a huge XML (in fact its an XSD file)
>
> Overall, its about 20,000 lines of XML to load. Even on my macbook pro
> with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
> GB of virtual memory). This is obviously unacceptable, but I am not sure
> that a work around exists?
>
> I wanted to load in the schema in order to validate the messages and xml
> I was generating. Has anyone any ideas on a potential work around?

Run it in windows? :)

But seriously, 20k lines of XML should not take that much memory unless
the lines are HUGE. How about a simplistic approach? I know that this
is not intensively RUBY but it may help.

What if you were to launch it in a browser? They display XML files in
formatted fashion which means that they must parse them. You could then
parse through the resulting page and see if there is an error message
therein. Just a text search for "XML Parsing Error" and that should
tell you if it worked.
--
Posted via http://www.ruby-....

Trans

7/23/2007 1:45:00 PM

On Jul 23, 4:28 am, Tim Perrett <freestyle_kaya...@hotmail.com> wrote:
> Hey guys
>
> I was wondering what advice anyone could possibly hand me about
> processing a huge XML (in fact its an XSD file)
>
> Overall, its about 20,000 lines of XML to load. Even on my macbook pro
> with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
> GB of virtual memory). This is obviously unacceptable, but I am not sure
> that a work around exists?
>
> I wanted to load in the schema in order to validate the messages and xml
> I was generating. Has anyone any ideas on a potential work around?

libxml has some know issues, memory consumption especially. Hopefully
they will get fixed, but in the mean time one can only frown at the
irony -- <rubyXML> was one of the earliest Ruby web sites around, yet
Ruby's support of _fast_ XML processing is still dearly lacking.

T.

Robert Klemme

7/23/2007 2:20:00 PM

2007/7/23, Tim Perrett <freestyle_kayaker@hotmail.com>:
> Hey guys
>
> I was wondering what advice anyone could possibly hand me about
> processing a huge XML (in fact its an XSD file)
>
> Overall, its about 20,000 lines of XML to load. Even on my macbook pro
> with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
> GB of virtual memory). This is obviously unacceptable, but I am not sure
> that a work around exists?
>
> I wanted to load in the schema in order to validate the messages and xml
> I was generating. Has anyone any ideas on a potential work around?

The generic answer would be, use a XML stream parser (as opposed to a
DOM parser). Even if you directly fill up a model that contains the
whole document it's likely less resource intensive than a DOM. Of
course it's optimal (resource wise) if you can do your validation on
the fly (i.e. while stream parsing).

Kind regards

robert

Tim Perrett

7/23/2007 2:23:00 PM

Lloyd Linklater wrote:

>
> Run it in windows? :)
>
> But seriously, 20k lines of XML should not take that much memory unless
> the lines are HUGE. How about a simplistic approach? I know that this
> is not intensively RUBY but it may help.
>
> What if you were to launch it in a browser? They display XML files in
> formatted fashion which means that they must parse them. You could then
> parse through the resulting page and see if there is an error message
> therein. Just a text search for "XML Parsing Error" and that should
> tell you if it worked.

Thats a very fair point actually, if it runs in the browser, it must be
parsable. Its actually 32,606 lines!
Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
be able to use less i would have thought? Unless its DOM methodology is
just a lot more memory intensive?

What are peoples thoughts? Is it crazy trying to ask libxml to read that
much into memory?

Cheers

Tim
--
Posted via http://www.ruby-....

Lloyd Linklater

7/23/2007 2:45:00 PM

Tim Perrett wrote:

> Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
> be able to use less i would have thought? Unless its DOM methodology is
> just a lot more memory intensive?

I am new to ruby and, as much as I love the language syntax, I have yet
to see how to actually use it in real world applications. I know that
is likely to get me into trouble as everyone else seems to do it but
there it is.

That having been said, it can be seen that I do not know the inner
workings of Ruby well enough to dig that far inside. However, it cannot
be the DOM as the browser uses that to parse. There would have to be
some other thing that is making the difference and finding that goes
beyond my Ruby knowledge.
--
Posted via http://www.ruby-....

James Moore

7/23/2007 3:00:00 PM

On 7/23/07, Tim Perrett <freestyle_kayaker@hotmail.com> wrote:
> I was wondering what advice anyone could possibly hand me about
> processing a huge XML (in fact its an XSD file)

Something's going wrong. 20k lines is a pretty small XML file; we're
sucking in files that are larger than that (50meg or so - a little
less than a million lines long) many times a day using the Ruby libxml
bindings and not seeing a similar issue. It's possible that your
average line length is _much_ longer than ours, of course. Our normal
process size is about 400m, but a big chunk of that is the processing
we're doing on the data; I want to say that the size after loading in
the xml is in the 200m range, but I haven't looked at that for a
while.

Are you doing stream processing? We never tried to load the whole
document at once, so there may be an issue doing that.

- James Moore

Tim Perrett

7/23/2007 4:09:00 PM

Lloyd Linklater wrote:
> That having been said, it can be seen that I do not know the inner
> workings of Ruby well enough to dig that far inside. However, it cannot
> be the DOM as the browser uses that to parse. There would have to be
> some other thing that is making the difference and finding that goes
> beyond my Ruby knowledge.

I wonder if its somthing to do with the XSD includes and imports that it
doesnt like.... i might have to ask the libxml core team

Cheers

Tim
--
Posted via http://www.ruby-....

Raymond O'connor

7/26/2007 4:41:00 AM

I wrote a ruby script which parses a 25gb xml file. I used the
XMLParser library from http://www.yoshidam.net...

So parsing a large amount of xml can definitely be accomplished.

-Ray
--
Posted via http://www.ruby-....

Tim Perrett

7/26/2007 10:57:00 PM

Hey all

thanks for your replys!

The file in question is actually an XSD file, so I think your right,
XML::Schema.new() would use DOM parsing. Does lixml even suport stream
parsing? I cant seem to find a great deal on it...

Has anyone ever had any experience with such a large XSD? I cant think
there would be a way of validating the instance xml without the XSD
being held in memory to then check against?

How do things like xerces manage it with java?

I fear i might be wanting the imposible! lol

Cheers

-Tim
--
Posted via http://www.ruby-....

Robert Klemme

7/27/2007 6:19:00 AM

2007/7/27, Tim Perrett <freestyle_kayaker@hotmail.com>:
> The file in question is actually an XSD file, so I think your right,
> XML::Schema.new() would use DOM parsing. Does lixml even suport stream
> parsing? I cant seem to find a great deal on it...
>
> Has anyone ever had any experience with such a large XSD? I cant think
> there would be a way of validating the instance xml without the XSD
> being held in memory to then check against?

Yes and no: since the XML (XSD in your case) is known the parser could
store an optimized representation in memory (i.e. does not need the
original DOM).

> How do things like xerces manage it with java?

When a colleague testes JDom few years ago, it needed loads of mem.
But of course, that could have changed by now (and also, there's 64
bit JVMs).

> I fear i might be wanting the imposible! lol

"Impossible is nothing - Ruby..." :-)

Kind regards

robert

comp.lang.ruby

Processing a huge xml file

Tim Perrett

Lloyd Linklater

Trans

Robert Klemme

Tim Perrett

Lloyd Linklater

James Moore

Tim Perrett

Raymond O'connor

Tim Perrett

Robert Klemme

x Login to ForumsZone