Asp Forum - Bypassing XML inconsistencies with REXML::StreamListener

nutsmuggler

9/13/2007 6:12:00 PM

Hello folks.
I am trying to build a simple XML parser to extract data from IBM
translation manager memories. Here is a sample os such memory files:

<NTMMemoryDb>
<Description>

</Description>
<Segment>0000000001
<Control>
00012200000001178876638English(U.S.)ITALIANIBMIDDOCBB1CTMST.
000BB1CTmst.idd
</Control>
<Source><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007"
noteindent="no-noteindent"
brand="default-brand"></Source>
<Target><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007"
noteindent="no-noteindent"
brand="default-brand"></Target>
</Segment>
<Segment>0000000002
<Control>
00000300000001178876638English(U.S.)ITALIANIBMIDDOCCONFIGUR.
000Configuration_PDSG.IDE
</Control>
<Source><titleblk>
<title>Configuration information and guidelines</title>
</titleblk></Source>
<Target><titleblk>
<title>Informazioni e istruzioni per la configurazione</title>
</titleblk></Target>
etc...

These memory files are quite similar to XML files, but I suspect they
actually conform to another standard. In fact, they often include
"opened" tags; these because they store segments of translation; thus,
when the translation is referred to a website or a SGML document, the
original HTML or SGML might be split in two or more parts. So I often
encounter faulty segments; open tags generate a REXML fault.
My code is quite simple :

require 'rexml/document'
require 'rexml/streamlistener'
include REXML

class Listener
include StreamListener
$segment = ""
$result = ""
$is_there = false
def tag_start(name, attributes)
if name == "Source"
$segment << "EN:"
end
if name == "Target"
$segment << "IT:"
end
end
def tag_end(name)
if name == "Target"
if $is_there
$result << $segment
end
$segment = ""
$is_there = false
end
if name == "NTMMemoryDb"
puts $result
end
end
def text(text)
$segment << text
if text =~ /blade/
$is_there = true
end
end
end

listener = Listener.new
parser =
Parsers::StreamParser.new(File.new("bch01aad006_MEMORIA.EXP"),
listener)
parser.parse

I need to bypass mistakes, and tell StreamListener: "when you
encounter a faulty segment, don't bother!"
How do I achieve this?
Thanks in advance,
Davide

3 Answers

James Britt

9/13/2007 7:33:00 PM

nutsmuggler wrote:
> Hello folks.
> I am trying to build a simple XML parser to extract data from IBM
> translation manager memories. Here is a sample os such memory files:
...
>
> I need to bypass mistakes, and tell StreamListener: "when you
> encounter a faulty segment, don't bother!"
> How do I achieve this?

Don't use an XML parser to handle non-XML?

Alternatively, have you tried the REXML pull parser? A bit more work in
that you have to explicitly pop items off the tag stack, but it may have
better options for recovering from bad markup.

However, the underlying parser may still barf in trying to segment the
source into tags and such.

Also, I don't know if Hpricot is happy with non-HTML, but it's worth a
shot to see if it can read and "fix" the source before you pass it to
another parser. You'll want to check that any modification made to the
input do not change the essential semantics.

(Or perhaps you could just use Hpricot and extract data with XPath)

--
James Britt

"In Ruby, no one cares who your parents were, all they care
about is if you know what you are talking about."
- Logan Capaldo

Alex Young

9/13/2007 7:55:00 PM

nutsmuggler wrote:
> Hello folks.
> I am trying to build a simple XML parser to extract data from IBM
> translation manager memories. Here is a sample os such memory files:
>
> <NTMMemoryDb>
> <Description>
>
> </Description>
> <Segment>0000000001
> <Control>
> 00012200000001178876638English(U.S.)ITALIANIBMIDDOCBB1CTMST.
> 000BB1CTmst.idd
> </Control>
> <Source><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007"
> noteindent="no-noteindent"
> brand="default-brand"></Source>
> <Target><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007"
> noteindent="no-noteindent"
> brand="default-brand"></Target>
> </Segment>
> <Segment>0000000002
> <Control>
> 00000300000001178876638English(U.S.)ITALIANIBMIDDOCCONFIGUR.
> 000Configuration_PDSG.IDE
> </Control>
> <Source><titleblk>
> <title>Configuration information and guidelines</title>
> </titleblk></Source>
> <Target><titleblk>
> <title>Informazioni e istruzioni per la configurazione</title>
> </titleblk></Target>
> etc...
>
>
> These memory files are quite similar to XML files, but I suspect they
> actually conform to another standard. In fact, they often include
> "opened" tags; these because they store segments of translation; thus,
> when the translation is referred to a website or a SGML document, the
> original HTML or SGML might be split in two or more parts. So I often
> encounter faulty segments; open tags generate a REXML fault.
<snip>
It might be worth trying HTML Tidy in XML mode. I can't remember off
the top of my head how it'll react to missing close tags, but it's worth
a shot...

--
Alex

nutsmuggler

9/13/2007 8:53:00 PM

hpricot is my man :-)
Being an HTML parser, it's much less hard to please.
Here is the basic code I am using:

require 'rubygems'
require 'hpricot'

doc = Hpricot.XML(open("bch01aad006_MEMORIA.EXP"))
doc.search("Source").each do |item|
if item.innerHTML =~ /firmware/
puts "EN: #{item}"
puts "IT: #{item.next_sibling}"
end
end

The principle it's quite easy, and much more coincise than the rexml
solution.
Thanks a million for the tip.
Davide

comp.lang.ruby

Bypassing XML inconsistencies with REXML::StreamListener

nutsmuggler

James Britt

Alex Young

nutsmuggler

x Login to ForumsZone