Asp Forum - Stream Parsing with REXML

Marc Hoeppner

1/12/2008 4:58:00 PM

Hi (again, sort of) :)

I am still on my quest to write a program that parses a large XML file.
After having tried to do it in tree mode, I had to realize that the
performance was simply abysmal. So back to the drawing board. But, and
here is the thing...I could find a good straight-forward tutorial on how
to write a stream parser using REXML. The official tutorial is pretty
much mute on that part and the only other example I found (or rather was
pointed to -
http://www.janvereecken.com/2007/4/11/event-driven-xml-pars...)
was way too complex for someone like me who is still pretty much a
beginner in ruby.

So, what I am looking for is either a brief description of how to write
an event driven parser or else a link to a good and simple tutorial.

For the former, this is what the parser should do:

Find the element "Gene-ref", allow me to access its children and then
close and repeat for the next "Gene-ref entry. In xml code, that would
look like

<something here>
<Gene-ref>
<name>...</name>
<start>...</start>
<end>...</end>
</Gen-ref>
<Gen-ref>
...

I understand that I need a Listener class, like
classListener
def tag_start(name, attrs)
end
def text(text)
end
def tag_end(name)
end
end

But I havent really worked with classes all that much and maybe someone
could just put down the basics for the script from where I can start
experimenting? Would be very much appreciated. Let's say for each
element "Gene-ref" I want to puts the name, start and end in one line,
or something along those lines.

Cheers,

Marc
--
Posted via http://www.ruby-....

12 Answers

dusty

1/12/2008 6:52:00 PM

I have found that hpricot works very well with large xml files. It
streams it automatically. I use it to parse a 5M xml file and it does
it rather quickly.

http://code.whytheluckystiff.net/do...
http://code.whytheluckystiff.ne...

Bob Hutchison

1/13/2008 3:50:00 AM

Hi Marc and Dusty,

On 12-Jan-08, at 1:55 PM, dusty wrote:

> I have found that hpricot works very well with large xml files. It
> streams it automatically. I use it to parse a 5M xml file and it does
> it rather quickly.

Dusty, you sure hpricot is streaming?

At any rate, it is a lot faster than any pure Ruby XML parser. Marc,
if performance is your only problem hpricot is worth looking at. It'll
only take a few seconds to know, it's something like 4 lines of Ruby
to find out:

require 'rubygems'
require 'hpricot'

doc = open(<your file name here>) do |f|
Hpricot.XML(f)
end

If that works well enough you're probably set. Hpricot's search and
query stuff is pretty nice.

If you still have problems, let us know. If you really need streaming
you have a few options.

>
>
> http://code.whytheluckystiff.net/do...
> http://code.whytheluckystiff.ne...
>

Cheers,
Bob

----
Bob Hutchison -- tumblelog at http://www.recurs...
Recursive Design Inc. -- weblog at http://www.recursiv...
http://www.rec... -- works on http://www.raconteur.info/cms-for-static-con...

Phrogz

1/13/2008 5:24:00 AM

On Jan 12, 9:58 am, Marc Hoeppner <marc.hoepp...@molbio.su.se> wrote:
> Hi (again, sort of) :)
>
> I am still on my quest to write a program that parses a large XML file.
> After having tried to do it in tree mode, I had to realize that the
> performance was simply abysmal. So back to the drawing board. But, and
> here is the thing...I could find a good straight-forward tutorial on how
> to write a stream parser using REXML. The official tutorial is pretty
> much mute on that part and the only other example I found (or rather was
> pointed to -http://www.janvereecken.com/2007/4/11/event-driven-xml-pars...)
> was way too complex for someone like me who is still pretty much a
> beginner in ruby.
>
> So, what I am looking for is either a brief description of how to write
> an event driven parser or else a link to a good and simple tutorial.

xml = <<ENDXML
oh hai
<root>
<kid1 foo="bar" jim="jam" />
<kid2 foo="sausage">
<grandkid1>I ate <raw> sausages&tm;!</grandkid1>
<grandkid2>
<![CDATA[
I ate <raw> sausages&tm;!
]]>
</grandkid2>
</kid2>
</root>
whoa!
ENDXML

class Listener
def initialize
# Use instance variables to mantain information
# between different callbacks.
@nest_level = 0
end

def tag_start( name, attr_hash )
@nest_level += 1
print " " * @nest_level
puts "<#{name} ...> #{attr_hash.inspect}"
end

def text( str )
print " " * @nest_level
p str
end

# Treat CDATA sections just like text
alias_method :cdata, :text

def tag_end( name )
print " " * @nest_level
puts "</#{name}>"
@nest_level -= 1
end
end

require 'rexml/document'
require 'stringio' # for this demo
stream_handler = Listener.new
pseudo_file = StringIO.new( xml )
REXML::Document.parse_stream( pseudo_file, stream_handler )

#=> OUTPUT:
#=> "oh hai\n"
#=> <root ...> {}
#=> "\n "
#=> <kid1 ...> {"jim"=>"jam", "foo"=>"bar"}
#=> </kid1>
#=> "\n "
#=> <kid2 ...> {"foo"=>"sausage"}
#=> "\n "
#=> <grandkid1 ...> {}
#=> "I ate <raw> sausages&tm;!"
#=> </grandkid1>
#=> "\n "
#=> <grandkid2 ...> {}
#=> "\n "
#=> "\n I ate <raw> sausages&tm;!\n "
#=> "\n "
#=> </grandkid2>
#=> "\n "
#=> </kid2>
#=> "\n"
#=> </root>
#=> "\nwhoa!\n"

Phrogz

1/13/2008 5:37:00 AM

On Jan 12, 9:58 am, Marc Hoeppner <marc.hoepp...@molbio.su.se> wrote:
> Find the element "Gene-ref", allow me to access its children and then
> close and repeat for the next "Gene-ref entry. In xml code, that would
> look like
>
> <something here>
> <Gene-ref>
> <name>...</name>
> <start>...</start>
> <end>...</end>
> </Gen-ref>
> <Gen-ref>
> ...

Here's a more targeted example, since I'm feeling helpful:

class Gene
attr_accessor :name, :start, :end
def initialize( name=nil, start=nil, endd=nil )
@name = name
self.start = start
self.end = endd
end
def start=( start )
@start = start.to_i
end
def end=( endd )
@end = endd.to_i
end
def duration
@end - @start
end
def to_s
"<Gene '#{@name}' #{@start}-#{@end} (#{duration})> "
end
end

class GeneParser
GENE_TAG_NAME = "Gene-ref"
def tag_start( name, attributes )
case name
when "root"
# ignore
when GENE_TAG_NAME
@current_gene = Gene.new
else
@current_property = name
end
end

def text( str )
if @current_gene && @current_property
@current_gene.send( @current_property.to_s + "=", str )
end
end

def tag_end( name )
if name == GENE_TAG_NAME
puts @current_gene
@current_gene = nil
else
@current_property = nil
end
end
end

require 'rexml/document'
REXML::Document.parse_stream( DATA, GeneParser.new )
#=> OUTPUT:
#=> <Gene 'foo' 17-42 (25)>
#=> <Gene 'bar' 43-50 (7)>

__END__
<root>
<Gene-ref>
<name>foo</name>
<start>17</start>
<end>42</end>
</Gene-ref>
<Gene-ref>
<name>bar</name>
<start>43</start>
<end>50</end>
</Gene-ref>
</root>

Marc Hoeppner

1/13/2008 12:38:00 PM

Thanks for the great responses!

Just for clarification though:

tag_start identifies an element like "gene" ala

def tag_start(name, attrib)
if name=="gene"
do something here
end
end

So tag_end is then used if I want to puts everything that was done with
the element and its children like storing some values in an array or
something?

/Marc
--
Posted via http://www.ruby-....

Robert Klemme

1/13/2008 6:02:00 PM

On 13.01.2008 13:37, Marc Hoeppner wrote:
> Thanks for the great responses!
>
> Just for clarification though:
>
> tag_start identifies an element like "gene" ala
>
> def tag_start(name, attrib)
> if name=="gene"
> do something here
> end
> end
>
> So tag_end is then used if I want to puts everything that was done with
> the element and its children like storing some values in an array or
> something?

Yeah, you could do that. Basically it's completely up to you. The
parser just hands off events for parsed XML items (like starting tags,
closing tags, text data etc.) in the order found in the document.

I have attached another example of an idiom that I use frequently. This
may not be needed in your case but who knows? Basic idea is that the
event listener creates nested listeners (one per element) and hands
processing off to them while maintaining a stack of listeners. That way
you can do different processing based on element name, attributes or
whatever. Might be overkill for your simple example but OTOH if you do
need to do complex processing steps based on elements this might be
exactly what you need. For example, you can store any information you
need in a nested listener and do all the processing in end tag.

Kind regards

robert
#!/bin/env ruby

# Robert Klemme 2007

require 'rexml/document'
require 'rexml/streamlistener'
require 'delegate'

class StreamListener < Delegator

def initialize
@current = NestedStreamListener.new
end

def tag_start(name, attrs)
sl = listener(name, attrs).new
sl.parent = @current
@current = sl
sl.tag_start(name, attrs)
end

def tag_end(name)
@current.tag_end(name)
@current = @current.parent
end

def __getobj__
@current or raise "Cannot handle beyond root"
end

private
def listener(name, attrs)
# more complex code here for specific
# nested listeners
NestedStreamListener
end
end

class BaseNestedStreamListener
include REXML::StreamListener

# structure
attr_accessor :parent

protected

# parents from the root
def parents
res = []
sl = self

while sl
res.unshift sl
sl = sl.parent
end

res
end
end

class NestedStreamListener < BaseNestedStreamListener

attr_accessor :tag_name

def tag_start(name, attrs)
# demo only
self.tag_name = name

parents.each do |par|
print par.tag_name, " "
end

puts

@text = ""
end

def tag_end(name)
print " ", @text.inspect, "\n"
end

def text(s)
(@text ||= "") << s
end

alias cdata text

end

REXML::Document.parse_stream( DATA, StreamListener.new )

__END__
<root>
<Gene-ref>
<name>foo</name>
<start>17</start>
<end>42</end>
</Gene-ref>
<Gene-ref>
<name>bar</name>
<start>43</start>
<end>50</end>
</Gene-ref>
</root>

Bob Hutchison

1/13/2008 7:55:00 PM

Hi Marc, Robert:

On 13-Jan-08, at 1:04 PM, Robert Klemme wrote:

> I have attached another example of an idiom that I use frequently.
> This may not be needed in your case but who knows? Basic idea is
> that the event listener creates nested listeners (one per element)
> and hands processing off to them while maintaining a stack of
> listeners. That way you can do different processing based on
> element name, attributes or whatever. Might be overkill for your
> simple example but OTOH if you do need to do complex processing
> steps based on elements this might be exactly what you need. For
> example, you can store any information you need in a nested listener
> and do all the processing in end tag.

That idiom is really frequent in the stuff that I do with XML. It also
happens to be a case where a pull parser can make things clearer. Pull
parsers are event based like SAX parsers are, they do not build up
trees in memory, and they are very fast. The code below is almost
equivalent to yours Robert, as far as I could make it anyway. The
parents method is different, and I habitually add a node level
corresponding to the document since there are things (comments,
processing instructions, etc) that are allowed outside of the root
element of an XML file.

I wrote that xampl-pp parser in May 2002. It is on source forge <http://sourceforge.net/projects...
> but not as a gem. Hmm, maybe I should do something about that, I've
made some fixes to very obscure problems over the years, but never
released anything. Anyway the point isn't xampl-pp so much as what a
pull parser looks like. Many people don't know about them and if they
do they often don't know why they are useful. I think the libxml2
library does pull parsing (and that'll be very fast) and I think REXML
does too (it has a few problems with namespaces last I looked).
Anyway...

All the interesting stuff happens in the work method.

#!/bin/env ruby

require "xampl-pp"

class Thing

attr_accessor :tag_name
attr_accessor :parent
attr_accessor :text

def initialize(parent, name)
@parent = parent
@tag_name = name
@text = ""
end

def Thing.start(xml)
pp = Xampl_PP.new
pp.input = xml

Thing.new(nil, 'doc').work(pp)
end

def parents
p = self
while p
yield p
p = p.parent
end
end

def work(pp)
parents do |par|
print par.tag_name, " "
end
puts

while not pp.endDocument? do
case pp.nextEvent
when Xampl_PP::START_ELEMENT
Thing.new(self, pp.name).work(pp)
when Xampl_PP::END_ELEMENT
print " ", text.inspect, "\n"
return
when Xampl_PP::TEXT || Xampl_PP::CDATA_SECTION ||
Xampl_PP::ENTITY_REF then
text << pp.text
end
end
end
end

Thing.start(DATA)

__END__
<root>
<Gene-ref>
<name>foo</name>
<start>17</start>
<end>42</end>
</Gene-ref>
<Gene-ref>
<name>bar</name>
<start>43</start>
<end>50</end>
</Gene-ref>
</root>

----
Bob Hutchison -- tumblelog at http://www.recurs...
Recursive Design Inc. -- weblog at http://www.recursiv...
http://www.rec... -- works on http://www.raconteur.info/cms-for-static-con...

Robert Klemme

1/14/2008 7:46:00 AM

On 13.01.2008 20:54, Bob Hutchison wrote:
> Hi Marc, Robert:
>
> On 13-Jan-08, at 1:04 PM, Robert Klemme wrote:
>
>> I have attached another example of an idiom that I use frequently.
>> This may not be needed in your case but who knows? Basic idea is
>> that the event listener creates nested listeners (one per element)
>> and hands processing off to them while maintaining a stack of
>> listeners. That way you can do different processing based on
>> element name, attributes or whatever. Might be overkill for your
>> simple example but OTOH if you do need to do complex processing
>> steps based on elements this might be exactly what you need. For
>> example, you can store any information you need in a nested listener
>> and do all the processing in end tag.
>
> That idiom is really frequent in the stuff that I do with XML. It also
> happens to be a case where a pull parser can make things clearer. Pull
> parsers are event based like SAX parsers are, they do not build up
> trees in memory, and they are very fast. The code below is almost
> equivalent to yours Robert, as far as I could make it anyway. The
> parents method is different,

I considered that as well but wanted to be able to traverse parent nodes
top down for the example. After all it was just a quick hack. :-) For
more robustness and library fitness I would certainly change a few things.

> and I habitually add a node level
> corresponding to the document since there are things (comments,
> processing instructions, etc) that are allowed outside of the root
> element of an XML file.

If you look closely that happens in my version as well (see
StreamListener#initialize and #tag_start).

> I wrote that xampl-pp parser in May 2002. It is on source forge <http://sourceforge.net/projects...
> > but not as a gem. Hmm, maybe I should do something about that, I've
> made some fixes to very obscure problems over the years, but never
> released anything. Anyway the point isn't xampl-pp so much as what a
> pull parser looks like. Many people don't know about them and if they
> do they often don't know why they are useful. I think the libxml2
> library does pull parsing (and that'll be very fast) and I think REXML
> does too (it has a few problems with namespaces last I looked).
> Anyway...
>
> All the interesting stuff happens in the work method.
>
> #!/bin/env ruby
>
> require "xampl-pp"
>
> class Thing
>
> attr_accessor :tag_name
> attr_accessor :parent
> attr_accessor :text
>
> def initialize(parent, name)
> @parent = parent
> @tag_name = name
> @text = ""
> end
>
> def Thing.start(xml)
> pp = Xampl_PP.new
> pp.input = xml
>
> Thing.new(nil, 'doc').work(pp)
> end
>
> def parents
> p = self
> while p
> yield p
> p = p.parent
> end
> end
>
> def work(pp)
> parents do |par|
> print par.tag_name, " "
> end
> puts
>
> while not pp.endDocument? do
> case pp.nextEvent
> when Xampl_PP::START_ELEMENT
> Thing.new(self, pp.name).work(pp)
> when Xampl_PP::END_ELEMENT
> print " ", text.inspect, "\n"
> return
> when Xampl_PP::TEXT || Xampl_PP::CDATA_SECTION ||
> Xampl_PP::ENTITY_REF then
> text << pp.text
> end
> end
> end
> end
>
> Thing.start(DATA)

Yep, looks pretty similar. To be honest I never used an XML pull parser
myself. Personally I prefer the push parser a bit because it avoids the
looping and decision logic based on event and element type (in #work).
Are there major advantages of pull parsers over push parsers that I have
overlooked so far?

Kind regards

robert

Marc Hoeppner

1/14/2008 12:08:00 PM

Ok, I am still on this and am starting to feel a bit stupid...I guess
there is something wrong with the following example, but maybe I should
step back for a while. Anyhow - what's missing to make it work:

The xml file:

<Gene-ref>
<name>Gene1</name>
<start>1</start>
<end>10</end>
</Gen-ref>
<Gene-ref>
<name>Gene2</name>
<start>11</start>
<end>20</end>
</Gen-ref>
<Gene-ref>
<name>Gene3</name>
<start>21</start>
<end>30</end>
</Gen-ref>
<Gene-ref>
<name>Gene4</name>
<start>31</start>
<end>40</end>
</Gen-ref>
<Gene-ref>
<name>Gene5</name>
<start>41</start>
<end>50</end>
</Gen-ref>

The code:

data = File.new(ARGV.shift)

class Listener

def tag_start( name, attr_hash )
if name == 'Gene-ref'
@name = name
end
end

def text( str )

end

def tag_end( name )
if name == 'Gene-ref'
puts @name
@name = nil
end
end
end

require 'rexml/document'
REXML::Document.parse_stream( data, Listener.new )
--
Posted via http://www.ruby-....

Robert Klemme

1/14/2008 12:24:00 PM

2008/1/14, Marc Hoeppner <marc.hoeppner@molbio.su.se>:
> Ok, I am still on this and am starting to feel a bit stupid...I guess
> there is something wrong with the following example, but maybe I should
> step back for a while. Anyhow - what's missing to make it work:
>
> The xml file:
>
> <Gene-ref>
> <name>Gene1</name>
> <start>1</start>
> <end>10</end>
> </Gen-ref>
> <Gene-ref>
> <name>Gene2</name>
> <start>11</start>
> <end>20</end>
> </Gen-ref>
> <Gene-ref>
> <name>Gene3</name>
> <start>21</start>
> <end>30</end>
> </Gen-ref>
> <Gene-ref>
> <name>Gene4</name>
> <start>31</start>
> <end>40</end>
> </Gen-ref>
> <Gene-ref>
> <name>Gene5</name>
> <start>41</start>
> <end>50</end>
> </Gen-ref>
>
> The code:
>
> data = File.new(ARGV.shift)
>
> class Listener
>
> def tag_start( name, attr_hash )
> if name == 'Gene-ref'
> @name = name
> end
> end
>
> def text( str )

You need to place the @name storing code here. But to make it work
you need to remember the current tag's name because you want to only
store "str" in @name if you are in <name>. This makes it necessary
that you add a stack to your listener class which gets pushed and
popped whenever tag_start and tag_end are invoked.

> end
>
> def tag_end( name )
> if name == 'Gene-ref'
> puts @name
> @name = nil
> end
> end
> end
>
>
> require 'rexml/document'
> REXML::Document.parse_stream( data, Listener.new )

Cheers

robert

--
use.inject do |as, often| as.you_can - without end

comp.lang.ruby

Stream Parsing with REXML

Marc Hoeppner

dusty

Bob Hutchison

Phrogz

Phrogz

Marc Hoeppner

Robert Klemme

Bob Hutchison

Robert Klemme

Marc Hoeppner

Robert Klemme

x Login to ForumsZone