Asp Forum - HTML parser Hpricot? and how to get all text

SpringFlowers AutumnMoon

10/29/2007 1:29:00 PM

Would a good HTML parser be Hpricot? I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags).
--
Posted via http://www.ruby-....

10 Answers

mortee

10/29/2007 2:33:00 PM

Thomas Wieczorek

10/29/2007 3:23:00 PM

2007/10/29, SpringFlowers AutumnMoon <summercoolness@gmail.com>:
> Would a good HTML parser be Hpricot?

It is a good and fast HTML and XML parser.

>I wonder if anyone knows an easy
> way for it to get all text of an HTML file? (removing all formatting
> tags).
>

Mortee's is a quick way to do it. If you need more information to it,
take a look at http://code.whytheluckystiff.n... or ask on
hpricot's mailing list.

Phlip

10/29/2007 6:53:00 PM

SpringFlowers AutumnMoon wrote:
> Would a good HTML parser be Hpricot?
>
It's extremely good; try it and see!
> I wonder if anyone knows an easy
> way for it to get all text of an HTML file? (removing all formatting
> tags).
>

each_element( './/text()' ){}.join() might do it.

--
Phlip

SpringFlowers AutumnMoon

10/30/2007 2:47:00 PM

Phlip wrote:
> SpringFlowers AutumnMoon wrote:
>> Would a good HTML parser be Hpricot?
>>
> It's extremely good; try it and see!
>> I wonder if anyone knows an easy
>> way for it to get all text of an HTML file? (removing all formatting
>> tags).
>>
>
> .each_element( './/text()' ){}.join() might do it.

anyone knows where to go from:

require 'hpricot'
doc = Hpricot("<b>hello <i>world</i></b>")

and what can i do to get "hello world"?

in
http://code.whytheluckystiff.net/hpricot/wiki/HpricotChallenge#Stripa...
it says just use

str=doc.to_s
print str.gsub(/<\/?[^>]*>/, "")

but can't the < > be nested in some HTML code? If it is nested then
the above won't work, it seems.

--
Posted via http://www.ruby-....

SpringFlowers AutumnMoon

10/30/2007 2:51:00 PM

by the way

require 'hpricot'

doc = Hpricot("<b>hello <i>world</i></b>")

p doc.search("").inner_text

won't work... i am not sure if it is the Win installer of Ruby... but it
is the most recent Win installer.

it says

scraper2.rb:6: undefined method `inner_text' for
#<Hpricot::Elements:0x348dbc4>
(NoMethodError)

and doc.to_plain_text() won't work either...

--
Posted via http://www.ruby-....

mortee

10/30/2007 8:13:00 PM

7stud --

10/30/2007 8:33:00 PM

SpringFlowers AutumnMoon wrote:
>
> in
> http://code.whytheluckystiff.net/hpricot/wiki/HpricotChallenge#Stripa...
> it says just use
>
> str=doc.to_s
> print str.gsub(/<\/?[^>]*>/, "")
>
> but can't the < > be nested in some HTML code? If it is nested then
> the above won't work, it seems.

What do you mean by nested? I would consider your example as containing
nested tags:

<b>hello <i>world</i></b>"

and the regex removes all the tags from that string. html can look like
this:

<h2<p>hel<b<i>></h2<b>llo<h1<b<i>>>worl</i><b></h1>

What do you want to do with that string?

--
Posted via http://www.ruby-....

mortee

10/31/2007 4:04:00 PM

SpringFlowers AutumnMoon

11/3/2007 7:21:00 AM

mortee wrote:
> kendear wrote:
>>> irb(main):001:0> require 'hpricot'
>> yup, mine is
>>
>> C:\>gem list hpricot
>>
>> *** LOCAL GEMS ***
>>
>> hpricot (0.4)
>> a swift, liberal HTML parser with a fantastic library
>>
>> and d.inner_text or d.text both won't work.
>>
>
> Does something prevent you from upgrading?

I finally got the time to upgrade to Hpricot 6.0
so now, the following

require 'net/http'
require 'hpricot'

r = ""

Net::HTTP.start("www.google.com") do |http|
r = http.get("/")
end

c = Hpricot(r.body)
p c.to_plain_text

will work, and so will

p c.inner_text

as the last line. however, the CSS and Javascript lines are not
removed. So I think I can gsub the CSS and Javascript blocks with the
multiline regexp gsub.

I wonder though if there is a quick way, that will do what the lynx on
UNIX does... just print out a plain and readable text page.

--
Posted via http://www.ruby-....

SpringFlowers AutumnMoon

11/3/2007 8:11:00 AM

> however, the CSS and Javascript lines are not
> removed. So I think I can gsub the CSS and Javascript blocks with the
> multiline regexp gsub.
>
> I wonder though if there is a quick way, that will do what the lynx on
> UNIX does... just print out a plain and readable text page.

i got it to work till:

require 'open-uri'
require 'hpricot'

c = open('http://www.googl...).read

c.gsub!(/<style.*?<\/style.*?>/m, " ")
c.gsub!(/<script.*?<\/script.*?>/m, " ")

c.gsub!(/<(span|tr|td| ).*?>/, " ")
c.gsub!(/<(br|p|div|table).*?>/, "\n")

d = Hpricot(c).inner_text
d.gsub!(/\s+/, " ")
d.gsub!(/\n+/, "\n")

print d

but it is not so pretty. and it is not filtering the non-printable
character too.

--
Posted via http://www.ruby-....

comp.lang.ruby

HTML parser Hpricot? and how to get all text

SpringFlowers AutumnMoon

mortee

Thomas Wieczorek

Phlip

SpringFlowers AutumnMoon

SpringFlowers AutumnMoon

mortee

7stud --

mortee

SpringFlowers AutumnMoon

SpringFlowers AutumnMoon

x Login to ForumsZone