Asp Forum - fast XML parser, other than libxml

Peter Szinek

4/4/2007 3:01:00 AM

Hello all,

I am looking for a fast XML parser, other than libxml (REXML is not fast
enough, and Hpricot won't do this time - I need 'real' XPaths etc).

Some time ago I read about xaggly, nut now the site seems to be dead.

Any other suggestions?

Cheers,
Peter
_
http://www.rubyra... :: Ruby and Web2.0 blog
http://s... :: Ruby web scraping framework
http://rubykitch... :: The indexed archive of all things Ruby

19 Answers

Keith Fahlgren

4/4/2007 3:31:00 AM

On 4/3/07, Peter Szinek <peter@rubyrailways.com> wrote:
> I am looking for a fast XML parser, other than libxml (REXML is not fast
> enough, and Hpricot won't do this time - I need 'real' XPaths etc).

libxml is a mature C library and quite fast, but is (by default)
DOM-based (as is REXML). If you're really looking for speed, you'll go
with a streaming approach (SAX or otherwise, potentially from libxml).
What sort of "real" XPaths do you need? XPath 1.0? 2.0?
Deep-lookahead/behind? Do you have huge source documents?

Keith

Peter Szinek

4/4/2007 8:53:00 AM

Keith Fahlgren wrote:

> libxml is a mature C library and quite fast, but is (by default)
> DOM-based (as is REXML).

Sorry, I did not express myself clearly. I definitely need a DOM-based
approach, but REXML is a lot slower than libxml, and libxml can be a
PITA to install on some platforms/distros (e.g. it took quite some time
on my ubuntu box, because neither gem install nor apt-get wanted to
install the newest version which I needed).

The catch is that I would like to use this in my web scraping framework,
scRUBYt! - and of course dependency on libxml would mean that everybody
who would like to install sRUBYt!, would have to install libxml too. I
got tons of support requests from ubuntu users who have had problems
installing mechanize on ubuntu (it is depending on libssl-ruby there),
so I guess this number would be much higher in the case of libxml which
has much more funky dependencies.

If there is no better possibility, I will go with libxml despite of this
(this is my only concern, otherwise libxml is fine) - but it would be
better to have something install-friendly...

> What sort of "real" XPaths do you need? XPath 1.0? 2.0?
Real in the sense that it is not Hpricot XPath, which ATM can not even do

/my/stuff/is/@cool

not to talk about more complex expressions.

I guess XPath 1.0 would be completely enough (maybe even Hpricot's, with
a few additions) - I really don't need anything complicated.

> Deep-lookahead/behind? Do you have huge source documents?
Well, I am actually first building this document from what I have
scraped, so I have pretty much control over it (if is too big, I just
say stop and put the other records to a new doc etc.) so this is not
really the problem.

I really just need a fast XML parser which is easy to install, that's
all. scRUBYt! is a high-level framework, aimed also at non-programmers,
so I can not expect that all my potential users are handy with debian's
package policy and the joys of libxml installing on win32 :)

Cheers,
Peter
_
http://www.rubyra... :: Ruby and Web2.0 blog
http://s... :: Ruby web scraping framework
http://rubykitch... :: The indexed archive of all things Ruby

Robert Klemme

4/4/2007 9:35:00 AM

On 04.04.2007 10:53, Peter Szinek wrote:
> I really just need a fast XML parser which is easy to install, that's
> all. scRUBYt! is a high-level framework, aimed also at non-programmers,
> so I can not expect that all my potential users are handy with debian's
> package policy and the joys of libxml installing on win32 :)

Maybe then you'll simply have to decide whether ease of use or
performance is more important to you.

Kind regards

robert

Peter Szinek

4/4/2007 10:01:00 AM

Robert Klemme wrote:
> On 04.04.2007 10:53, Peter Szinek wrote:
>> I really just need a fast XML parser which is easy to install, that's
>> all. scRUBYt! is a high-level framework, aimed also at
>> non-programmers, so I can not expect that all my potential users are
>> handy with debian's package policy and the joys of libxml installing
>> on win32 :)
>
> Maybe then you'll simply have to decide whether ease of use or
> performance is more important to you.

Should I interpret this as 'decide between REXML and libxml'?
There are really no other alternatives?

Cheers,
Peter
_
http://www.rubyra... :: Ruby and Web2.0 blog
http://s... :: Ruby web scraping framework
http://rubykitch... :: The indexed archive of all things Ruby

Robert Klemme

4/4/2007 11:14:00 AM

On 04.04.2007 12:00, Peter Szinek wrote:
> Robert Klemme wrote:
>> On 04.04.2007 10:53, Peter Szinek wrote:
>>> I really just need a fast XML parser which is easy to install, that's
>>> all. scRUBYt! is a high-level framework, aimed also at
>>> non-programmers, so I can not expect that all my potential users are
>>> handy with debian's package policy and the joys of libxml installing
>>> on win32 :)
>>
>> Maybe then you'll simply have to decide whether ease of use or
>> performance is more important to you.
>
> Should I interpret this as 'decide between REXML and libxml'?
> There are really no other alternatives?

AFAIK REXML is the only pure Ruby XML parser - and it comes with the
standard distribution. All others will likely have similar issues as
libxml I guess.

robert

James Gray

4/4/2007 12:47:00 PM

On Apr 4, 2007, at 6:15 AM, Robert Klemme wrote:

> On 04.04.2007 12:00, Peter Szinek wrote:
>> Robert Klemme wrote:
>>> On 04.04.2007 10:53, Peter Szinek wrote:
>>>> I really just need a fast XML parser which is easy to install,
>>>> that's all. scRUBYt! is a high-level framework, aimed also at
>>>> non-programmers, so I can not expect that all my potential users
>>>> are handy with debian's package policy and the joys of libxml
>>>> installing on win32 :)
>>>
>>> Maybe then you'll simply have to decide whether ease of use or
>>> performance is more important to you.
>> Should I interpret this as 'decide between REXML and libxml'?
>> There are really no other alternatives?
>
> AFAIK REXML is the only pure Ruby XML parser - and it comes with
> the standard distribution.

Sounds like it is time for FasterXML. :)

James Edward Gray II

Keith Fahlgren

4/4/2007 1:35:00 PM

On 4/4/07, Peter Szinek <peter@rubyrailways.com> wrote:
> > libxml is a mature C library and quite fast, but is (by default)
> > DOM-based (as is REXML).
>
> Sorry, I did not express myself clearly. I definitely need a DOM-based
> approach, but REXML is a lot slower than libxml, and libxml can be a
> PITA to install on some platforms/distros (e.g. it took quite some time
> on my ubuntu box, because neither gem install nor apt-get wanted to
> install the newest version which I needed).

Yeah, you're right about libxml being a pain to install. If you hadn't
cared about installability, I was going to suggest JRuby + (some Java
parser)....

> I guess XPath 1.0 would be completely enough (maybe even Hpricot's, with
> a few additions) - I really don't need anything complicated.

Yeah, sorry that I don't know of any others.

JEG II wrote:
> Sounds like it is time for FasterXML. :)

Know of any good starting points? All the XPath 1.0 work I do is off
of libxml and all of the XPath 2.0 is off of Saxon (Java), so I'm not
sure what should be copied.

Keith

Keith

James Gray

4/4/2007 2:00:00 PM

On Apr 4, 2007, at 8:34 AM, Keith Fahlgren wrote:

> JEG II wrote:
>> Sounds like it is time for FasterXML. :)
>
> Know of any good starting points? All the XPath 1.0 work I do is off
> of libxml and all of the XPath 2.0 is off of Saxon (Java), so I'm not
> sure what should be copied.

Not really. I was mostly just making a joke about FasterCSV's name
and how it was born.

I do think it's possible to get better performance than REXML offers
without resorting to C, though C would be faster still, naturally. I
do have some ideas about this, but I haven't actually spent the time
to see if I could get a prototype running to prove them.

James Edward Gray II

Dennis Ranke

4/4/2007 2:01:00 PM

James Edward Gray II wrote:
> On Apr 4, 2007, at 6:15 AM, Robert Klemme wrote:
>
>> On 04.04.2007 12:00, Peter Szinek wrote:
>>> Robert Klemme wrote:
>>>> On 04.04.2007 10:53, Peter Szinek wrote:
>>>>> I really just need a fast XML parser which is easy to install,
>>>>> that's all. scRUBYt! is a high-level framework, aimed also at
>>>>> non-programmers, so I can not expect that all my potential users
>>>>> are handy with debian's package policy and the joys of libxml
>>>>> installing on win32 :)
>>>>
>>>> Maybe then you'll simply have to decide whether ease of use or
>>>> performance is more important to you.
>>> Should I interpret this as 'decide between REXML and libxml'?
>>> There are really no other alternatives?
>>
>> AFAIK REXML is the only pure Ruby XML parser - and it comes with the
>> standard distribution.
>
> Sounds like it is time for FasterXML. :)

One pointer: REXML comes with quite a fast pullparser, and it should be
possible to base some lightweight xml document lib on that. (The
documentation says that the API should not be considered stable, but I'm
sure that could be resolved with the REXML author.)

As a proof of concept, see the attached code. We use it in our company
to load and process XML files generated by our tools and OpenOffice Calc.
I just tested it on a 1MB XML from an .ods file, which it loaded
successfully in < 2 seconds.

Writing a fast XPath implementation to match this might be quite a
challenge, though. ;)

Dennis
require 'rexml/parsers/pullparser'

module XmlSimple
def self.load(filename)
parse(File.read(filename))
end

def self.parse(string)
parser = REXML::Parsers::PullParser.new(string)
return Node.new(['root', {}], parser)
end

class Node
include Enumerable

instance_methods(true).each {|m| undef_method(m) unless m =~ /__.*__/}
attr_reader :name, :attr, :text, :children
def initialize(token, parser)
@name = token[0]
@text = ''
@siblings = [self]
@attr = token[1]
@nodes = {}
@children = []
loop do
if parser.has_next?
tok = parser.pull
else
tok = REXML::Parsers::PullEvent.new([:end_element, 'root'])
end
case tok.event_type
when :start_element
node = Node.new(tok, parser)
@children << node
if @nodes[tok[0]]
@nodes[tok[0]].push_sibling(node)
else
@nodes[tok[0]] = node
end
when :end_element
raise unless tok[0] == @name
return
when :text
@text << tok[0]
@children << tok[0]
end
end
end

def push_sibling(node)
@siblings << node
end

def to_a
@siblings
end

def each(&block)
@siblings.each(&block)
end

def method_missing(m)
return @nodes[m.to_s]
end

def [](m)
return @nodes[m]
end

def inspect(indent = '')
r = indent + @name + ":\n"
indent += ' '
r << indent + 'attr: ' + attr.inspect + "\n" unless attr.empty?
r << indent + 'text: ' + text.inspect + "\n" unless text.empty?
@nodes.each do |k, v|
v.each {|n| r << n.inspect(indent)}
end
return r
end
end
end

Arto Bendiken

4/4/2007 10:04:00 PM

On Apr 4, 12:00 pm, Peter Szinek <p...@rubyrailways.com> wrote:
>
> Should I interpret this as 'decide between REXML and libxml'?
> There are really no other alternatives?

You may find Tim Bray's recent in-depth experiments in attempting to
write a fast pure-Ruby XML parser instructive and informative:

http://www.tbray.org/ongoing/When/200x/2006/11/09/Optim...
http://www.tbray.org/ongoing/When/200x/2006/11/1...

--
Arto Bendiken | http://ben...

comp.lang.ruby

fast XML parser, other than libxml

Peter Szinek

Keith Fahlgren

Peter Szinek

Robert Klemme

Peter Szinek

Robert Klemme

James Gray

Keith Fahlgren

James Gray

Dennis Ranke

Arto Bendiken

x Login to ForumsZone