Asp Forum - Help with HTML parsing

Vivek Netha

1/1/2009 9:43:00 PM

Hello,

I'm new to Watir\Ruby and need to resolve something that involves HTML
parsing - you could also call it screen scraping. I haven't used either
library before, but I wanted to know if it is better to use Hpricot or
open_uri. The problem is similar to below:

let's say I'm searching Google for some string, "Dungeons & Dragons" for
instance. I want to parse through the first results page and get the
title text and url for the top 5 results. How would I do this using
Hpricot or open_uri or both?

Please help!

Viv.
--
Posted via http://www.ruby-....

1 Answer

? ??

1/12/2009 12:15:00 AM

you can also use ruby library Sanitize (http://wonko.com/pos...)

This library can make you parse html template very easily.

let's see the following examples.

Using Sanitize is easy. First, install it:
sudo gem install sanitize

Then call it like so:

require 'rubygems'
require 'sanitize'

html =3D '<a href=3D"http://foo....>foo</a></b&... =
src=3D"http://foo.com/...
" />'

Sanitize.clean(html) # =3D> 'foo'

By default, Sanitize removes all HTML. You can use one of the built-in =20=

configs to tell Sanitize to allow certain attributes and elements:

Sanitize.clean(html, Sanitize::Config::RESTRICTED)
# =3D> 'foo'

Sanitize.clean(html, Sanitize::Config::BASIC)
# =3D> '<a href=3D"http://foo.... rel=3D"nofollow">foo</a>'

Sanitize.clean(html, Sanitize::Config::RELAXED)
# =3D> '<a href=3D"http://foo....>foo</a></b&... =
src=3D"http://foo.com/...
" />'

Or, if you=A1=AFd like more control over what=A1=AFs allowed, you can =
provide =20
your own custom configuration:

Sanitize.clean(html, :elements =3D> ['a', 'span'],
:attributes =3D> {'a' =3D> ['href', 'title'], 'span' =3D> =
['class']},
:protocols =3D> {'a' =3D> {'href' =3D> ['http', 'https', =
'mailto']}})

good one :)

2009. 01. 02, =BF=C0=C0=FC 6:42, Vivek Netha =C0=DB=BC=BA:

> Hello,
>
> I'm new to Watir\Ruby and need to resolve something that involves HTML
> parsing - you could also call it screen scraping. I haven't used =20
> either
> library before, but I wanted to know if it is better to use Hpricot or
> open_uri. The problem is similar to below:
>
> let's say I'm searching Google for some string, "Dungeons & Dragons" =20=

> for
> instance. I want to parse through the first results page and get the
> title text and url for the top 5 results. How would I do this using
> Hpricot or open_uri or both?
>
> Please help!
>
>
> Viv.
> --=20
> Posted via http://www.ruby-....
>
>

comp.lang.ruby

Help with HTML parsing

Vivek Netha

? ??

x Login to ForumsZone