Asp Forum - REXML libraries and parsing issues

BA

6/24/2005 12:55:00 AM

First off, let me say right up front that I am a newbie wrt Ruby.

I am trying to parse an XML file, however, am having all kinds of
trouble. I am using the REXML libraries and the sax2parser/listener.
In the sax2listener, I can use the character/text part of the method,
however, I cannot for the life of me figure out how to parse out JUST
WHAT I WANT. Here is what the file looks like as follows:

<B110><DNUM><PDAT> this is the text I need </PDAT></DNUM></B110>

If I use :character, %w{PDAT} {|text| puts text} ... I get the text
"this is the text I need" printed out. If I use the B110 or any
combination, I cannot get it to work. Anyone know how to get the
sax2parser/listener to parse the file and allow me to be selective
about what I parse out of the file? Thanks for any/all help in this
endeavor!!!!!!!!!!

-Bob Angell-
angellr@mac.com

8 Answers

james_b

6/24/2005 1:50:00 AM

BA wrote:
> First off, let me say right up front that I am a newbie wrt Ruby.
>
> I am trying to parse an XML file, however, am having all kinds of
> trouble. I am using the REXML libraries and the sax2parser/listener.
> In the sax2listener, I can use the character/text part of the method,
> however, I cannot for the life of me figure out how to parse out JUST
> WHAT I WANT. Here is what the file looks like as follows:
>
> <B110><DNUM><PDAT> this is the text I need </PDAT></DNUM></B110>
>
> If I use :character, %w{PDAT} {|text| puts text} ... I get the text
> "this is the text I need" printed out. If I use the B110 or any
> combination, I cannot get it to work. Anyone know how to get the
> sax2parser/listener to parse the file and allow me to be selective about
> what I parse out of the file? Thanks for any/all help in this
> endeavor!!!!!!!!!!

What, exactly, do you want? To extract the text from the PDAT element?

How predictable is the XML?

Are the files as small as your example?

Are regular expressions an option? Or using a DOM and XPath?

How did you decide to use the listner?

James

--

http://www.ru... - The Ruby Documentation Site
http://www.r... - News, Articles, and Listings for Ruby & XML
http://www.rub... - The Ruby Store for Ruby Stuff
http://www.jame... - Playing with Better Toys

BA

6/24/2005 2:00:00 AM

Yes, I want to extract the PDAT element, however, I want to use the
B110 tag to find this element. The XML *is* predictable, however,
there are variations in the placement of the elements (there could be
several different address fields and/or many paragraphs that need to be
parsed/searched). The files are *extremely* large (some could be as
large as 1-2GB). I would prefer to do all of the processing in Ruby if
this is possible (want to use the OO functionality for the text
processing I want to do) and would like to also incorporate regex if
possible (started doing this by parsing the file line by line, however,
ran into malformed XML where I decided that I needed to use the
database functionality of XML. Not sure if DOM would work. Could not
get XPath to work. The listener was, quite frankly, a SWAG. Thanks.

On Jun 23, 2005, at 7:50 PM, James Britt wrote:

> BA wrote:
>> First off, let me say right up front that I am a newbie wrt Ruby.
>> I am trying to parse an XML file, however, am having all kinds of
>> trouble. I am using the REXML libraries and the sax2parser/listener.
>> In the sax2listener, I can use the character/text part of the
>> method, however, I cannot for the life of me figure out how to parse
>> out JUST WHAT I WANT. Here is what the file looks like as follows:
>> <B110><DNUM><PDAT> this is the text I need </PDAT></DNUM></B110>
>> If I use :character, %w{PDAT} {|text| puts text} ... I get the text
>> "this is the text I need" printed out. If I use the B110 or any
>> combination, I cannot get it to work. Anyone know how to get the
>> sax2parser/listener to parse the file and allow me to be selective
>> about what I parse out of the file? Thanks for any/all help in this
>> endeavor!!!!!!!!!!
>
> What, exactly, do you want? To extract the text from the PDAT element?
>
> How predictable is the XML?
>
> Are the files as small as your example?
>
> Are regular expressions an option? Or using a DOM and XPath?
>
> How did you decide to use the listner?
>
> James
>
> --
>
> http://www.ru... - The Ruby Documentation Site
> http://www.r... - News, Articles, and Listings for Ruby & XML
> http://www.rub... - The Ruby Store for Ruby Stuff
> http://www.jame... - Playing with Better Toys
>

Bucco

6/24/2005 3:42:00 AM

How about something like:

require 'rexml/document'
doc = REXML::Document.new(File.open('someXMLFile.xml'))
info = doc.elements["//B110/DNUM/PDAT"].text
puts info

SA :)

James Gray

6/24/2005 4:02:00 AM

On Jun 23, 2005, at 10:45 PM, Bucco wrote:

> How about something like:
>
> require 'rexml/document'
> doc = REXML::Document.new(File.open('someXMLFile.xml'))
> info = doc.elements["//B110/DNUM/PDAT"].text
> puts info

For 2 Gig files?! Good luck!

James Edward Gray II

Ryan Leavengood

6/24/2005 4:09:00 AM

James Edward Gray II said:
> On Jun 23, 2005, at 10:45 PM, Bucco wrote:
>
>> How about something like:
>>
>> require 'rexml/document'
>> doc = REXML::Document.new(File.open('someXMLFile.xml'))
>> info = doc.elements["//B110/DNUM/PDAT"].text
>> puts info
>
> For 2 Gig files?! Good luck!

Hahahaha, I must agree with this. Of course the irony is that if those 2
gig XML files were in YAML, they would only be 5 megs ;)

Ryan

james_b

6/24/2005 5:02:00 AM

james_b

6/24/2005 5:16:00 AM

Ryan Leavengood wrote:
> ...
> Hahahaha, I must agree with this. Of course the irony is that if those 2
> gig XML files were in YAML, they would only be 5 megs ;)

How *does* one do stream parsing and partial reading of YAML files?
Is there a SAY parser?

James
--

http://www.ru... - The Ruby Documentation Site
http://www.r... - News, Articles, and Listings for Ruby & XML
http://www.rub... - The Ruby Store for Ruby Stuff
http://www.jame... - Playing with Better Toys

mathew

6/27/2005 5:49:00 PM

BA wrote:
> Yes, I want to extract the PDAT element, however, I want to use the B110
> tag to find this element. The XML *is* predictable, however, there are
> variations in the placement of the elements (there could be several
> different address fields and/or many paragraphs that need to be
> parsed/searched). The files are *extremely* large (some could be as
> large as 1-2GB).

Any time you're faced with a huge XML file and you only want to get
small pieces of it, you should think about using a stream parser, a la
Java's SAX2. The idea behind a stream parser is that the parser runs
through the entire file exactly once, and never has to seek forwards or
backwards; it throws the data to you in whatever size chunks are most
convenient for it. This means the parser does as little work as
possible, which means so long as your code is efficient, the end result
should be maximally efficient.

The downside is that you have to do a bit more work. For example,
depending on how the parser buffers things internally, it might send you
a piece of text inside an XML element in two or more pieces, and expect
you to glue them together. It's also up to you to deal with any
position-based restrictions on which elements you're interested in.

I assumed you wanted the text inside any PDAT element that was
*somewhere* inside a B110 element, so I simply track which elements are
currently "open" and by how many levels. Here's the code.

---
require 'rexml/document'
require 'rexml/parsers/streamparser'

class MyListener

def initialize
# Hash to record which elements we are inside at any given moment, and
# how many of them we are inside
@inside = Hash.new
@textbuffer = ''
end

def tag_start(name, attrs)
if @inside[name]
@inside[name] += 1
else
@inside[name] = 1
end
end

def text(text)
if @inside['B110'] and @inside['PDAT']
@textbuffer += text
end
end

def tag_end(name)
if name == 'PDAT'
# Output the text if we just closed a PDAT inside a B110
if @inside['B110'] and @inside['PDAT']
puts @textbuffer
end
# Clear the buffer any time we close a PDAT
@textbuffer = ''
end
# Decrement count, set to nil if zero
# so @inside['foo'] works as a boolean
if @inside[name] == 1
@inside[name] = nil
else
@inside[name] -= 1
end
end
end

listener = MyListener.new
source = File.new "mydoc.xml"
REXML::Document.parse_stream(source, listener)
---

Here's a sample file:

---

<FOO>
<B110><B110>
<SOMETHINGELSE>
<PDAT>This is the text you want</PDAT>This isn't.
</SOMETHINGELSE>
</B110>
<PDAT>This is sneaky good text</PDAT>
</B110>
<PDAT>This is bad text</PDAT>
</FOO>

---

Output:

---

This is the text you want
This is sneaky good text

---

Note that this uses the native REXML stream parser API, not the SAX2
clone, because the SAX2 clone is slower according to the documentation.

Disclaimers:

The above code is only lightly tested. Although a stream parser should
theoretically be the fastest option, I haven't actually benchmarked it
against (say) the pull parser. (Which is also documented as having an
unstable API, so personally I'd avoid it anyway.)

The above code will break if you have a PDAT somewhere inside a PDAT.
I'm assuming that's not allowed. If it is, you'll have to make your text
buffer be a stack of strings rather than a simple string, append to
@textbuffer[@inside['PDAT']], and take the performance hit.

Also, if you need to do more elaborate selection of which elements to
process, you'll obviously need to make changes to how the current
position is tracked... e.g. implementing "process PDAT elements only if
they are not buried more than 2 other elements deep inside a B110
element" is left as an exercise for the reader :-)

> (started doing this by parsing the file line by line, however,
> ran into malformed XML where I decided that I needed to use the database
> functionality of XML.

If by "malformed XML" you mean syntactically invalid XML, such as
unescaped < > characters, then you may be hosed, as REXML's parsers will
likely choke on it.

mathew
--
<URL:http://www.pobox.com/...

comp.lang.ruby

REXML libraries and parsing issues

BA

james_b

BA

Bucco

James Gray

Ryan Leavengood

james_b

james_b

mathew

x Login to ForumsZone