Asp Forum
Home
|
Login
|
Register
|
Search
Forums
>
comp.lang.ruby
Help with HTML parsing
Vivek Netha
1/1/2009 9:43:00 PM
Hello,
I'm new to Watir\Ruby and need to resolve something that involves HTML
parsing - you could also call it screen scraping. I haven't used either
library before, but I wanted to know if it is better to use Hpricot or
open_uri. The problem is similar to below:
let's say I'm searching Google for some string, "Dungeons & Dragons" for
instance. I want to parse through the first results page and get the
title text and url for the top 5 results. How would I do this using
Hpricot or open_uri or both?
Please help!
Viv.
--
Posted via
http://www.ruby-...
.
1 Answer
? ??
1/12/2009 12:15:00 AM
0
you can also use ruby library Sanitize (
http://wonko.com/pos...
)
This library can make you parse html template very easily.
let's see the following examples.
Using Sanitize is easy. First, install it:
sudo gem install sanitize
Then call it like so:
require 'rubygems'
require 'sanitize'
html =3D '<b><a href=3D"
http://foo....
>foo</a></b&... =
src=3D"
http://foo.com/...
" />'
Sanitize.clean(html) # =3D> 'foo'
By default, Sanitize removes all HTML. You can use one of the built-in =20=
configs to tell Sanitize to allow certain attributes and elements:
Sanitize.clean(html, Sanitize::Config::RESTRICTED)
# =3D> '<b>foo</b>'
Sanitize.clean(html, Sanitize::Config::BASIC)
# =3D> '<b><a href=3D"
http://foo....
rel=3D"nofollow">foo</a></b>'
Sanitize.clean(html, Sanitize::Config::RELAXED)
# =3D> '<b><a href=3D"
http://foo....
>foo</a></b&... =
src=3D"
http://foo.com/...
" />'
Or, if you=A1=AFd like more control over what=A1=AFs allowed, you can =
provide =20
your own custom configuration:
Sanitize.clean(html, :elements =3D> ['a', 'span'],
:attributes =3D> {'a' =3D> ['href', 'title'], 'span' =3D> =
['class']},
:protocols =3D> {'a' =3D> {'href' =3D> ['http', 'https', =
'mailto']}})
good one :)
2009. 01. 02, =BF=C0=C0=FC 6:42, Vivek Netha =C0=DB=BC=BA:
> Hello,
>
> I'm new to Watir\Ruby and need to resolve something that involves HTML
> parsing - you could also call it screen scraping. I haven't used =20
> either
> library before, but I wanted to know if it is better to use Hpricot or
> open_uri. The problem is similar to below:
>
> let's say I'm searching Google for some string, "Dungeons & Dragons" =20=
> for
> instance. I want to parse through the first results page and get the
> title text and url for the top 5 results. How would I do this using
> Hpricot or open_uri or both?
>
> Please help!
>
>
> Viv.
> --=20
> Posted via
http://www.ruby-...
.
>
>
Servizio di avviso nuovi messaggi
Ricevi direttamente nella tua mail i nuovi messaggi per
Help with HTML parsing
Inserendo la tua e-mail nella casella sotto, riceverai un avviso tramite posta elettronica ogni volta che il motore di ricerca troverà un nuovo messaggio per te
Il servizio è completamente GRATUITO!
x
Login to ForumsZone
Login with Google
Login with E-Mail & Password