Asp Forum - XML parser; maybe ruby is too slow?

nutsmuggler

9/15/2007 12:50:00 PM

Hello folks.
I managed to write a SGML parser with the hpricot library. As I
explained in a previous thread, I just need to compare source and
traget tags of translation memory files from IBM Translation manager.
The script now runs effectively, but I realised that it cannot cope
with large files; I tried to process TM file larger than 1MB and the
script took ages to generate the output. Should I switch to a compiled
language for this specific task?
At any rate, here is the script, it's very basic; please let me know
if I did something wrong or if its slowness is a necessary drawback of
ruby being interpreted. Cheers,
Davide

#!/usr/local/bin/ruby
require 'rubygems'
require 'hpricot'

$pattern = "server"
result = File.new("result.html", "w")
$stdout = result
puts "<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN'
'http://www.w3.org/TR/html4/strict.dt...\n
<head>\n
<meta http-equiv='Content-type' content='text/html; charset=utf-8'>\n
<title>Ricerca di '#{$pattern}'</title>\n
<style type='text/css'>
body {
}
p {
margin: 0px;
}
p.source {
background: #FFFFCC;
padding: 10px 5px 10px 5px;
}
p.target {
background: #F8A271;
padding: 10px 5px 10px 5px;
}
span.pattern {
background: #B6B6B6;
}
</style>
</head>\n
<body>\n"
# per aprire lo stdin
# doc = Hpricot.XML(STDIN)

doc = Hpricot.XML(open("bch01aad006_MEMORIA.EXP"))
doc.search("Source").each do |item|
if item.innerHTML =~ /#{$pattern}/
highlightedSource = item.innerHTML.gsub(/#{$pattern}/, "<span
class='pattern'>#{$pattern}</span>")
puts "<p class='source'>EN: #{highlightedSource}</p>\n"
puts "<p class='target'>IT: #{item.next_sibling.html}</p>\n
<hr/>"
end
end
puts "</body>"

2 Answers

yermej

9/16/2007 3:57:00 AM

On Sep 15, 7:49 am, nutsmuggler <benini.dav...@gmail.com> wrote:
> Hello folks.
> I managed to write a SGML parser with the hpricot library. As I
> explained in a previous thread, I just need to compare source and
> traget tags of translation memory files from IBM Translation manager.
> The script now runs effectively, but I realised that it cannot cope
> with large files; I tried to process TM file larger than 1MB and the
> script took ages to generate the output. Should I switch to a compiled
> language for this specific task?
> At any rate, here is the script, it's very basic; please let me know
> if I did something wrong or if its slowness is a necessary drawback of
> ruby being interpreted. Cheers,
> Davide
>
> #!/usr/local/bin/ruby
> require 'rubygems'
> require 'hpricot'
>
> $pattern = "server"
> result = File.new("result.html", "w")
> $stdout = result
> puts "<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN'
> 'http://www.w3.org/TR/html4/strict.dt...\n
> <head>\n
> <meta http-equiv='Content-type' content='text/html; charset=utf-8'>\n
> <title>Ricerca di '#{$pattern}'</title>\n
> <style type='text/css'>
> body {
> }
> p {
> margin: 0px;
> }
> p.source {
> background: #FFFFCC;
> padding: 10px 5px 10px 5px;
> }
> p.target {
> background: #F8A271;
> padding: 10px 5px 10px 5px;
> }
> span.pattern {
> background: #B6B6B6;
> }
> </style>
> </head>\n
> <body>\n"
> # per aprire lo stdin
> # doc = Hpricot.XML(STDIN)
>
> doc = Hpricot.XML(open("bch01aad006_MEMORIA.EXP"))
> doc.search("Source").each do |item|
> if item.innerHTML =~ /#{$pattern}/
> highlightedSource = item.innerHTML.gsub(/#{$pattern}/, "<span
> class='pattern'>#{$pattern}</span>")
> puts "<p class='source'>EN: #{highlightedSource}</p>\n"
> puts "<p class='target'>IT: #{item.next_sibling.html}</p>\n
> <hr/>"
> end
> end
> puts "</body>"

I haven't done any comparison testing, but if your *.EXP files are
truly XML, Ruby libxml might be a better choice as it's just a wrapper
around the libxml2 library (see http://libxml.ruby...).

Jeremy

nutsmuggler

9/16/2007 10:05:00 AM

On 16 Set, 05:56, "yer...@gmail.com" <yer...@gmail.com> wrote:
> On Sep 15, 7:49 am, nutsmuggler <benini.dav...@gmail.com> wrote:
>
>
>
> > Hello folks.
> > I managed to write a SGML parser with the hpricot library. As I
> > explained in a previous thread, I just need to compare source and
> > traget tags of translation memory files from IBM Translation manager.
> > The script now runs effectively, but I realised that it cannot cope
> > with large files; I tried to process TM file larger than 1MB and the
> > script took ages to generate the output. Should I switch to a compiled
> > language for this specific task?
> > At any rate, here is the script, it's very basic; please let me know
> > if I did something wrong or if its slowness is a necessary drawback of
> > ruby being interpreted. Cheers,
> > Davide
>
> > #!/usr/local/bin/ruby
> > require 'rubygems'
> > require 'hpricot'
>
> > $pattern = "server"
> > result = File.new("result.html", "w")
> > $stdout = result
> > puts "<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN'
> > 'http://www.w3.org/TR/html4/strict.dt...\n
> > <head>\n
> > <meta http-equiv='Content-type' content='text/html; charset=utf-8'>\n
> > <title>Ricerca di '#{$pattern}'</title>\n
> > <style type='text/css'>
> > body {
> > }
> > p {
> > margin: 0px;
> > }
> > p.source {
> > background: #FFFFCC;
> > padding: 10px 5px 10px 5px;
> > }
> > p.target {
> > background: #F8A271;
> > padding: 10px 5px 10px 5px;
> > }
> > span.pattern {
> > background: #B6B6B6;
> > }
> > </style>
> > </head>\n
> > <body>\n"
> > # per aprire lo stdin
> > # doc = Hpricot.XML(STDIN)
>
> > doc = Hpricot.XML(open("bch01aad006_MEMORIA.EXP"))
> > doc.search("Source").each do |item|
> > if item.innerHTML =~ /#{$pattern}/
> > highlightedSource = item.innerHTML.gsub(/#{$pattern}/, "<span
> > class='pattern'>#{$pattern}</span>")
> > puts "<p class='source'>EN: #{highlightedSource}</p>\n"
> > puts "<p class='target'>IT: #{item.next_sibling.html}</p>\n
> > <hr/>"
> > end
> > end
> > puts "</body>"
>
> I haven't done any comparison testing, but if your *.EXP files are
> truly XML, Ruby libxml might be a better choice as it's just a wrapper
> around the libxml2 library (seehttp://libxml.ruby...).
>
> Jeremy

The problem is the EXP file are actually SGML; I could not parse them
with REXML precisely because they are not well formed XML: they
contains open tag, whoch are apparently valid in some SGML format, but
not in XML. That is why I had to use hpricot, which is less picky.
Cheers,
Davide

comp.lang.ruby

XML parser; maybe ruby is too slow?

nutsmuggler

yermej

nutsmuggler

x Login to ForumsZone