[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Decent HTML Parser?

Kevin Weller

7/11/2006 7:52:00 PM

Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.

I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!

--
Kevin Weller
Information Technology Crucible
http://www.itcr...
18 Answers

Bruno Celeste

7/11/2006 8:00:00 PM

0

You can check rubyful soup library at
http://www.crummy.com/software/Ru...

On 7/11/06, Kevin Weller
<"http://www.itcr.../contact"@ruby-la... wrote:
> Anybody have experience with a decent HTML parser for a Ruby
> application? I've looked around, and so far everything I've found is
> either unfinished, unstable, [relatively] undocumented, or just plain
> ugly in terms of API.
>
> I'd like a parser that can take a partial HTML file and return an
> easily-traversable data structure, in the same order that the elements
> appear in the file. I don't want or need a callback mechanism, only
> something I can iterate and tree-search. Though I don't hold much hope
> it will work, I will try using REXML on my text and see what it
> produces...results to be posted here. Thanks in advance!
>
> --
> Kevin Weller
> Information Technology Crucible
> http://www.itcr...
>
>

Alex Young

7/11/2006 8:11:00 PM

0

Bruno Celeste wrote:
> You can check rubyful soup library at
> http://www.crummy.com/software/Ru...
>
... and if that doesn't help, Tidy + REXML does fine for me.

--
Alex

> On 7/11/06, Kevin Weller
> <"http://www.itcr.../contact"@ruby-la... wrote:
>> Anybody have experience with a decent HTML parser for a Ruby
>> application? I've looked around, and so far everything I've found is
>> either unfinished, unstable, [relatively] undocumented, or just plain
>> ugly in terms of API.
>>
>> I'd like a parser that can take a partial HTML file and return an
>> easily-traversable data structure, in the same order that the elements
>> appear in the file. I don't want or need a callback mechanism, only
>> something I can iterate and tree-search. Though I don't hold much hope
>> it will work, I will try using REXML on my text and see what it
>> produces...results to be posted here. Thanks in advance!
>>
>> --
>> Kevin Weller
>> Information Technology Crucible
>> http://www.itcr...
>>
>>
>


Kenosis

7/11/2006 9:41:00 PM

0

To help us narrow things down, can you tell us which Ruby HTML mods
you've tried out?

Ken

Kevin Weller wrote:
> Anybody have experience with a decent HTML parser for a Ruby
> application? I've looked around, and so far everything I've found is
> either unfinished, unstable, [relatively] undocumented, or just plain
> ugly in terms of API.
>
> I'd like a parser that can take a partial HTML file and return an
> easily-traversable data structure, in the same order that the elements
> appear in the file. I don't want or need a callback mechanism, only
> something I can iterate and tree-search. Though I don't hold much hope
> it will work, I will try using REXML on my text and see what it
> produces...results to be posted here. Thanks in advance!
>
> --
> Kevin Weller
> Information Technology Crucible
> http://www.itcr...

Phillip Hutchings

7/11/2006 9:52:00 PM

0

On 7/12/06, Bruno Celeste <bruno.celeste@gmail.com> wrote:
> You can check rubyful soup library at
> http://www.crummy.com/software/Ru...

I'll second Rubyful Soup. It's not the fastest, but it tolerates bad
HTML. I used it for an intranet spider, it worked with anything I
could find.
--
Phillip Hutchings
http://www.sit...

Ezra Zygmuntowicz

7/11/2006 10:06:00 PM

0


On Jul 11, 2006, at 2:51 PM, Phillip Hutchings wrote:

> On 7/12/06, Bruno Celeste <bruno.celeste@gmail.com> wrote:
>> You can check rubyful soup library at
>> http://www.crummy.com/software/Ru...
>
> I'll second Rubyful Soup. It's not the fastest, but it tolerates bad
> HTML. I used it for an intranet spider, it worked with anything I
> could find.
> --
> Phillip Hutchings
> http://www.sit...
>


Have a look at Hpricot, _why's new ruby/C html parser. Its fast and
has nice features.

http://redhanded.hobix.com/inspect/okayGiveHpricot...

-Ezra

Trans

7/11/2006 10:07:00 PM

0


Kevin Weller wrote:
> Anybody have experience with a decent HTML parser for a Ruby
> application? I've looked around, and so far everything I've found is
> either unfinished, unstable, [relatively] undocumented, or just plain
> ugly in terms of API.
>
> I'd like a parser that can take a partial HTML file and return an
> easily-traversable data structure, in the same order that the elements
> appear in the file. I don't want or need a callback mechanism, only
> something I can iterate and tree-search. Though I don't hold much hope
> it will work, I will try using REXML on my text and see what it
> produces...results to be posted here. Thanks in advance!

I also metion Facets' tagiterator.rb (from ?nyasu's tagiter.rb) For
example:

a = TagIterator.new(stext)
a.first("body") do |y|
y.nth("dl",2) do |dl|
dl.enumtag("dt") do |t|
puts t.text.strip
end
end
end

http://facets.rubyforge.org/api/more/classes/TagIte...

T.


Geoff Davis

7/11/2006 10:12:00 PM

0

On Tue, 11 Jul 2006 13:52:13 -0600, Kevin Weller wrote:

> Anybody have experience with a decent HTML parser for a Ruby
> application? I've looked around, and so far everything I've found is
> either unfinished, unstable, [relatively] undocumented, or just plain
> ugly in terms of API.
>
> I'd like a parser that can take a partial HTML file and return an
> easily-traversable data structure, in the same order that the elements
> appear in the file. I don't want or need a callback mechanism, only
> something I can iterate and tree-search. Though I don't hold much hope
> it will work, I will try using REXML on my text and see what it
> produces...results to be posted here. Thanks in advance!

You might find Rubyful Soup useful:

http://www.crummy.com/software/Ru...

It's a bit slow, but it is quite robust to bad HTML.

Kevin Weller

7/12/2006 12:49:00 AM

0

Thanks for the reply. I've basically reviewed every potential match
generated by:

http://raa.ruby-lang.org/search.rhtml?search=h...

I've since tried out a couple, and the option that seems to work best so
far is Ned Konz' ruby-htmltools. Unfortunately, it does not seem to
parse partial HTML documents well, so I've had to resort to parsing the
whole thing, extracting a REXML document object, then using XPath to get
to the content I care about. Seems like a waste of processing power
when I can get the necessary markup in text with a simple file.grep
operation and (theoretically) parse only the text that I want, but at
least I have something that works until/unless something better comes
along. Any recommendations?

Kenosis wrote:
> To help us narrow things down, can you tell us which Ruby HTML mods
> you've tried out?
>
> Ken
>
> Kevin Weller wrote:
>> Anybody have experience with a decent HTML parser for a Ruby
>> application? I've looked around, and so far everything I've found is
>> either unfinished, unstable, [relatively] undocumented, or just plain
>> ugly in terms of API.

Kevin Weller

7/12/2006 12:53:00 AM

0

Geoff Davis wrote:
> On Tue, 11 Jul 2006 13:52:13 -0600, Kevin Weller wrote:
>
>> Anybody have experience with a decent HTML parser for a Ruby
>> application? I've looked around, and so far everything I've found is
>> either unfinished, unstable, [relatively] undocumented, or just plain
>> ugly in terms of API.
>> --- snip ---
>
> You might find Rubyful Soup useful:
>
> http://www.crummy.com/software/Ru...
>
> It's a bit slow, but it is quite robust to bad HTML.

Ooooh, thanks, that might be just what the doctor ordered...especially
if it handles a single text line of an HTML document. Right now I have
a temporary solution that involves using ruby-htmltools to parse the
entire document, then finding the part that I want with an XPath query.
However, Rubyful Soup might turn out to be a better performer if it
does what I want. Thanks so much!

assaf.arkin@gmail.com

7/12/2006 8:08:00 AM

0

Kevin,

I settled on using Tidy to clean up the HTML, then parsing it into a
tree using the HTML scanner that comes with Rails.

Tidy does all the hard stuff of dealing with bad HTML and straightening
it up. The HTML scanner is very lightweight and has a simple, clean
API. You don't need to run Rails, just require the scanner library
(look for html/document.rb).

It's two passes, but with Tidy being C++ and HTML scanner doing no
cleanup, it's amazingly fast. I'm processing around 500Kb/s (mobile Duo
Core 1.8GHz).

You can walk the DOM, or use XPath-like finders, or my preferred method
of looking up content: using CSS selectors.

If you're doing HTML scraping this library will do all the hard work
for you:
http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit...

Assaf
http://la...


Kevin Weller wrote:
> Anybody have experience with a decent HTML parser for a Ruby
> application? I've looked around, and so far everything I've found is
> either unfinished, unstable, [relatively] undocumented, or just plain
> ugly in terms of API.
>
> I'd like a parser that can take a partial HTML file and return an
> easily-traversable data structure, in the same order that the elements
> appear in the file. I don't want or need a callback mechanism, only
> something I can iterate and tree-search. Though I don't hold much hope
> it will work, I will try using REXML on my text and see what it
> produces...results to be posted here. Thanks in advance!
>
> --
> Kevin Weller
> Information Technology Crucible
> http://www.itcr...