[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Need help parsing HTML with Hpricot...

Randy R

10/25/2007 6:56:00 AM

I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:


This is one line of text<br />
This is another line of text<br />
It keeps going on like this<br />
<br />
Until a new paragraph is started<br />
Otherwise, it's just more of the same<br />


I know, it looks simple but, frankly, I have no clue how to parse this
with Hpricot. Particularly, I don't know how to single out the lines of
text in between the "br" tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
Does anyone know how to do this with Hpricot?
Thank you...


3 Answers

Mikel Lindsaar

10/25/2007 7:37:00 AM

0

You can try each_child.

I will use each_child_with_index to show you what I mean:

Put your raw HTML text into @text

@parsed_html = Hpricot(@text)
@parsed_html.each_child_with_index do |c,i|
puts "Line #{i}: #{c.to_s.strip}"
end

Produces:

Line 0: This is one line of text
Line 1: <br />
Line 2: This is another line of text
Line 3: <br />
Line 4: It keeps going on like this
Line 5: <br />
Line 6:
Line 7: <br />
Line 8: Until a new paragraph is started
Line 9: <br />
Line 10: Otherwise, it's just more of the same
Line 11: <br />
Line 12:

Hope that helps.

Mikel


On 10/25/07, Just Another Victim of the Ambient Morality
<ihatespam@hotmail.com> wrote:
> I'm having trouble understanding Hpricot (thanks to an abominable lack
> of documentation). I'm trying to parse HTML of the following nature:
>
>
> This is one line of text<br />
> This is another line of text<br />
> It keeps going on like this<br />
> <br />
> Until a new paragraph is started<br />
> Otherwise, it's just more of the same<br />
>
>
> I know, it looks simple but, frankly, I have no clue how to parse this
> with Hpricot. Particularly, I don't know how to single out the lines of
> text in between the "br" tags. This is important 'cause I need to know
> where the line breaks are in the text, as well as the new paragraphs.
> Does anyone know how to do this with Hpricot?
> Thank you...
>
>
>
>

Thomas Wieczorek

10/25/2007 7:47:00 AM

0

2007/10/25, Just Another Victim of the Ambient Morality <ihatespam@hotmail.com>:
> I'm having trouble understanding Hpricot (thanks to an abominable lack
> of documentation). I'm trying to parse HTML of the following nature:
>

Try http://code.whytheluckystiff.net/hpricot/wiki/AnHpric...
for examples and some better documentation. It helped me a lot to
solve my problems.

Mikel Lindsaar

10/25/2007 7:49:00 AM

0

Of course... you could also do:

require 'rubygems'
require 'hpricot'

text =<<HERE
This is one line of text<br />
This is another line of text<br />
It keeps going on like this<br />
<br />
Until a new paragraph is started<br />
Otherwise, it's just more of the same<br />
HERE

class String
def not_needed?
self.strip == "<br />" ? true : false
end
end

@parsed_html = Hpricot(text)
@paragraphs = Array.new
@parsed_html.each_child_with_index do |c,i|
line = c.to_s.strip
if line == ""
puts "<p>#{@paragraphs}</p>"
@paragraphs.clear
else
@paragraphs << "#{line} " unless line.not_needed?
end
end

Which produces:

<p>This is one line of text This is another line of text It keeps
going on like this </p>
<p>Until a new paragraph is started Otherwise, it's just more of the same </p>

Now... don't pick on my favorite HTML parser again! :D Just ask nicely :)

Mikel