Asp Forum - Need help parsing HTML with Hpricot...

Randy R

10/25/2007 6:56:00 AM

I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:

This is one line of text 
This is another line of text 
It keeps going on like this 
 
Until a new paragraph is started 
Otherwise, it's just more of the same 

I know, it looks simple but, frankly, I have no clue how to parse this
with Hpricot. Particularly, I don't know how to single out the lines of
text in between the "br" tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
Does anyone know how to do this with Hpricot?
Thank you...

3 Answers

Mikel Lindsaar

10/25/2007 7:37:00 AM

You can try each_child.

I will use each_child_with_index to show you what I mean:

Put your raw HTML text into @text

@parsed_html = Hpricot(@text)
@parsed_html.each_child_with_index do |c,i|
puts "Line #{i}: #{c.to_s.strip}"
end

Produces:

Line 0: This is one line of text
Line 1: 
Line 2: This is another line of text
Line 3: 
Line 4: It keeps going on like this
Line 5: 
Line 6:
Line 7: 
Line 8: Until a new paragraph is started
Line 9: 
Line 10: Otherwise, it's just more of the same
Line 11: 
Line 12:

Hope that helps.

Mikel

On 10/25/07, Just Another Victim of the Ambient Morality
<ihatespam@hotmail.com> wrote:
> I'm having trouble understanding Hpricot (thanks to an abominable lack
> of documentation). I'm trying to parse HTML of the following nature:
>
>
> This is one line of text 
> This is another line of text 
> It keeps going on like this 
> 
> Until a new paragraph is started 
> Otherwise, it's just more of the same 
>
>
> I know, it looks simple but, frankly, I have no clue how to parse this
> with Hpricot. Particularly, I don't know how to single out the lines of
> text in between the "br" tags. This is important 'cause I need to know
> where the line breaks are in the text, as well as the new paragraphs.
> Does anyone know how to do this with Hpricot?
> Thank you...
>
>
>
>

Thomas Wieczorek

10/25/2007 7:47:00 AM

2007/10/25, Just Another Victim of the Ambient Morality <ihatespam@hotmail.com>:
> I'm having trouble understanding Hpricot (thanks to an abominable lack
> of documentation). I'm trying to parse HTML of the following nature:
>

Try http://code.whytheluckystiff.net/hpricot/wiki/AnHpric...
for examples and some better documentation. It helped me a lot to
solve my problems.

Mikel Lindsaar

10/25/2007 7:49:00 AM

Of course... you could also do:

require 'rubygems'
require 'hpricot'

text =<<HERE
This is one line of text 
This is another line of text 
It keeps going on like this 
 
Until a new paragraph is started 
Otherwise, it's just more of the same 
HERE

class String
def not_needed?
self.strip == " " ? true : false
end
end

@parsed_html = Hpricot(text)
@paragraphs = Array.new
@parsed_html.each_child_with_index do |c,i|
line = c.to_s.strip
if line == ""
puts "#{@paragraphs}"
@paragraphs.clear
else
@paragraphs << "#{line} " unless line.not_needed?
end
end

Which produces:

This is one line of text This is another line of text It keeps
going on like this 
Until a new paragraph is started Otherwise, it's just more of the same 

Now... don't pick on my favorite HTML parser again! :D Just ask nicely :)

Mikel

comp.lang.ruby

Need help parsing HTML with Hpricot...

Randy R

Mikel Lindsaar

Thomas Wieczorek

Mikel Lindsaar

x Login to ForumsZone