Asp Forum - REXML ... performance & memory usage ...

Jeff Wood

11/3/2006 11:39:00 PM

Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
it currently eats almost 800Mb of ram before it seems to do anything ...

Does anybody have any tips on getting REXML to run faster and/or smaller
???

I know it's slow just because it's pure ruby ... and there's a lot going
on ... but ... I can sit here for many minutes just waiting for ANY
console output showing that it's actually gotten to the first
root.elements.each( xpath_expr ) iteration ...

Hints/Tips are/would be VERY much appreciated.

Thanks in advance.

jd

13 Answers

Tom Werner

11/3/2006 11:47:00 PM

Jeff Wood wrote:
> Does anybody have any tips on getting REXML to run faster and/or
> smaller ???
>

If having a pure ruby parser is not a requirement and you're on *nix,
then you can get great performance out of:

http://libxml.ruby...

It uses libxml2 for the parsing, and as such is quite speedy.

Tom

Jeff Wood

11/4/2006 12:33:00 AM

Tom Werner wrote:
> Jeff Wood wrote:
>> Does anybody have any tips on getting REXML to run faster and/or
>> smaller ???
>>
>
> If having a pure ruby parser is not a requirement and you're on *nix,
> then you can get great performance out of:
>
> http://libxml.ruby...
>
> It uses libxml2 for the parsing, and as such is quite speedy.
>
> Tom
>
>
I had to make two fixes to the source to get things to compile

ruby_xml_parser.c & ruby_xml_document.c both needed to have #include
"stdargs.h" included ... the compiler wasn't happy about trying to deal
with the va_list data type without it.

But, it's compiling now ... just thought I'd pass the information along
for ya.

After modifying my script to use the libxml binding ... it's sitting @
about 220M used instead of 800+M ... ( better ) ... and does only take
10-20 seconds to start iterating over data ...

So, thank you for the pointer ...

jd

David Vallner

11/4/2006 12:40:00 AM

Jeff Wood wrote:
> Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
> it currently eats almost 800Mb of ram before it seems to do anything ...
>

At that file size, I'd also slightly start thinking of biting into the
bitter pill and using stream / pull parsing instead of tree parsing.
Even with using a C parser a DOM buildup is not going to do much good
for performance if you need to to processing at that scale more than
seldom. But then again, there's the premature optimization quote that
says to wait with that just yet.

David Vallner

Devin Mullins

11/4/2006 12:48:00 AM

Jeff Wood wrote:
> After modifying my script to use the libxml binding ... it's sitting @
> about 220M used instead of 800+M ... ( better ) ... and does only take
> 10-20 seconds to start iterating over data ...

WOW.

You might try optimizing your XPath query. I'm no expert at this (or
even knowledgeable), but I did find in the past that changing the XPath
sometimes made a drastic difference in performance.

Devin

Chilkat Software

11/4/2006 1:12:00 AM

Jeff,

I recently ported the (freeware) Chilkat XML parser to Ruby, but it only runs
on Windows. I'm curious to see how it performs in comparison. Do you have
a simple example w/ data that I can use to convert to Chilkat
XML? I'll be happy
to write the code...

Best Regards,
Matt Fausey

At 06:33 PM 11/3/2006, you wrote:

>Tom Werner wrote:
>>Jeff Wood wrote:
>>>Does anybody have any tips on getting REXML to run faster and/or smaller ???
>>
>>If having a pure ruby parser is not a requirement and you're on
>>*nix, then you can get great performance out of:
>>
>>http://libxml.ruby...
>>
>>It uses libxml2 for the parsing, and as such is quite speedy.
>>
>>Tom
>>
>I had to make two fixes to the source to get things to compile
>
>ruby_xml_parser.c & ruby_xml_document.c both needed to have
>#include "stdargs.h" included ... the compiler wasn't happy about
>trying to deal with the va_list data type without it.
>
>But, it's compiling now ... just thought I'd pass the information
>along for ya.
>
>After modifying my script to use the libxml binding ... it's sitting
>@ about 220M used instead of 800+M ... ( better ) ... and does only
>take 10-20 seconds to start iterating over data ...
>
>So, thank you for the pointer ...
>
>jd
>
>
>
>
>
>--
>No virus found in this incoming message.
>Checked by AVG Free Edition.
>Version: 7.1.409 / Virus Database: 268.13.25/515 - Release Date: 11/3/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.25/515 - Release Date: 11/3/2006

Robert Klemme

11/4/2006 8:58:00 AM

David Vallner wrote:
> Jeff Wood wrote:
>> Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
>> it currently eats almost 800Mb of ram before it seems to do anything ...
>
> At that file size, I'd also slightly start thinking of biting into the
> bitter pill and using stream / pull parsing instead of tree parsing.
> Even with using a C parser a DOM buildup is not going to do much good
> for performance if you need to to processing at that scale more than
> seldom. But then again, there's the premature optimization quote that
> says to wait with that just yet.

I would not necessarily call that premature optimization. If these
kinds of files are to be parsed frequently and if only a portion of them
needs extracting then I would also go down the stream parser road.

From my experience stream parsers are also appropriate if you have to
transform the XML tree of a document into some other object structure.
IMHO the coding effort for transforming a DOM into another object tree
vs. doing the same with the stream approach is quite equivalent. And
runtime wise you save yourself one whole tree traversal by going stream.

Kind regards

robert

Ross Bamford

11/4/2006 10:23:00 AM

On Sat, 2006-11-04 at 09:33 +0900, Jeff Wood wrote:
> Tom Werner wrote:
> > Jeff Wood wrote:
> >> Does anybody have any tips on getting REXML to run faster and/or
> >> smaller ???
> >>
> >
> > If having a pure ruby parser is not a requirement and you're on *nix,
> > then you can get great performance out of:
> >
> > http://libxml.ruby...
> >
> > It uses libxml2 for the parsing, and as such is quite speedy.
> >
> > Tom
> >
> >
> I had to make two fixes to the source to get things to compile
>
> ruby_xml_parser.c & ruby_xml_document.c both needed to have #include
> "stdargs.h" included ... the compiler wasn't happy about trying to deal
> with the va_list data type without it.
>

It's a good job I try to keep up with happenings on ruby-talk :) Thanks
for posting about this - it's fixed in CVS now.

Also, given your input data, you might be interested to know that I'm
currently working on a developmental branch for libxml-ruby 0.4, which
includes a new, faster SAX callback interface (among many other
changes). The branch name is DEV_0_4, and it's getting to be quite
stable now.

Also, we have a mailing list:

http://rubyforge.org/mail/?gr...

Thanks again,
--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

Tomasz Wegrzanowski

11/8/2006 11:04:00 PM

On 11/4/06, Jeff Wood <jeff@dark-light.com> wrote:
> Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
> it currently eats almost 800Mb of ram before it seems to do anything ...
>
> Does anybody have any tips on getting REXML to run faster and/or smaller
> ???
>
> I know it's slow just because it's pure ruby ... and there's a lot going
> on ... but ... I can sit here for many minutes just waiting for ANY
> console output showing that it's actually gotten to the first
> root.elements.each( xpath_expr ) iteration ...
>
> Hints/Tips are/would be VERY much appreciated.

magic/xml has extremely convenient stream parsing interface.
It's based on REXML so it's pretty slow, but it handles hundreds of
MBs big XMLs using just a few MBs of memory.

The idea is simple - you give it a block, and the block
keeps getting incomplete subtrees. It can either decide
to complete the current subtree (all children read to memory),
or to get inside it.

It's something like:

XML.parse_as_twigs(STDIN) {|node|
next unless node.name == :page
node.complete! # Read all children of <page>...</page> node
t = node[:@title] # :@title is a child
i = node[:@id] # :@id is another child
print "#{i}: #{t}\n"
}

A short tutorial at http://zabor.org/taw/magic_xml/tut...

I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl's XML::Twig).

Enjoy :-)

--
Tomasz Wegrzanowski [ http://t-a-w.blo... ]

Jeff Wood

11/8/2006 11:22:00 PM

Tomasz Wegrzanowski wrote:

> On 11/4/06, Jeff Wood <jeff@dark-light.com> wrote:
>
>> Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
>> it currently eats almost 800Mb of ram before it seems to do anything ...
>>
>> Does anybody have any tips on getting REXML to run faster and/or smaller
>> ???
>>
>> I know it's slow just because it's pure ruby ... and there's a lot going
>> on ... but ... I can sit here for many minutes just waiting for ANY
>> console output showing that it's actually gotten to the first
>> root.elements.each( xpath_expr ) iteration ...
>>
>> Hints/Tips are/would be VERY much appreciated.
>
>
> magic/xml has extremely convenient stream parsing interface.
> It's based on REXML so it's pretty slow, but it handles hundreds of
> MBs big XMLs using just a few MBs of memory.
>
> The idea is simple - you give it a block, and the block
> keeps getting incomplete subtrees. It can either decide
> to complete the current subtree (all children read to memory),
> or to get inside it.
>
> It's something like:
>
> XML.parse_as_twigs(STDIN) {|node|
> next unless node.name == :page
> node.complete! # Read all children of <page>...</page> node
> t = node[:@title] # :@title is a child
> i = node[:@id] # :@id is another child
> print "#{i}: #{t}\n"
> }
>
> A short tutorial at http://zabor.org/taw/magic_xml/tut...
>
> I think subtree-based parsers are a great tradeoff between
> convenience of read-everything parsers and low memory use
> of stream-based parsers. Deciding inside a block seems
> much more natural than predefining matched tags (like
> in Perl's XML::Twig).
>
> Enjoy :-)
>
Thanks for the tip, I'll have to take a look...

jd

Marcus Bristav

11/9/2006 8:11:00 AM

On 11/9/06, Tomasz Wegrzanowski <tomasz.wegrzanowski@gmail.com> wrote:
> I think subtree-based parsers are a great tradeoff between
> convenience of read-everything parsers and low memory use
> of stream-based parsers. Deciding inside a block seems
> much more natural than predefining matched tags (like
> in Perl's XML::Twig).
>

Back in the world of j... there are these libs (nux and dom4j and
probably more). They let you stream parse and register callbacks to
xpath expressions. Whenever a registered xpath is encountered it
invokes the callback for that xpath using a dom object (not w3c
DOM...) for the complete sub tree. This is very convenient and raises
the abstraction a bit (the xpath part) from what seems to be your
approach. They don't allow full xpath but only those parts that make
sense in this context.

Anyways, look into it, it's very nice.

/Marcus

ps. I think XML processing tools sucks quite a bit in Ruby (I love
Ruby...). You cannot do high performance processing in a cross
platform way (as far as I know). Libxml on *nix or MSXML on win (since
REXML sucks perfomance wise). It's kind of sad. Is it impossible to
make libxml/libxsl work on Windows?

comp.lang.ruby

REXML ... performance & memory usage ...

Jeff Wood

Tom Werner

Jeff Wood

David Vallner

Devin Mullins

Chilkat Software

Robert Klemme

Ross Bamford

Tomasz Wegrzanowski

Jeff Wood

Marcus Bristav

x Login to ForumsZone