[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

how to extract url's from html source of google search result

sujeet kumar

6/11/2005 6:44:00 PM

hi
I want to make a Tk window where you give some input string and it
search that on google and prints the web address (http url) of the
result found on google in the TkFrame of that window. My program
connects to net and get the html source through function "http.get".
Now from html source , how can I find the url's of the search. Can i
do it by regular expression or any other way.
Give me any suggestion.
Thanks
sujeet


3 Answers

Marcel Molina Jr.

6/12/2005 12:31:00 AM

0

On Sun, Jun 12, 2005 at 03:44:03AM +0900, sujeet kumar wrote:
> I want to make a Tk window where you give some input string and it
> search that on google and prints the web address (http url) of the
> result found on google in the TkFrame of that window. My program
> connects to net and get the html source through function "http.get".
> Now from html source , how can I find the url's of the search. Can i
> do it by regular expression or any other way.
> Give me any suggestion.

The URI.extract method from the uri library can extract an array of uri's from
a string:

require 'uri'
URI.extract('My favorite site is http://googl...)
# => ["http://google...]

An optional second argument can limit the schemes that it will match against
and return:

URI.extract('Why do people use mailto:me@lala.org links?')
# => ["mailto:me@lala.org"]
URI.extract('Why do people use mailto:me@lala.org links?', 'http')
# => []

marcel
--
Marcel Molina Jr. <marcel@vernix.org>


Alexey Verkhovsky

6/12/2005 12:45:00 AM

0

Marcel Molina Jr. wrote:

>On Sun, Jun 12, 2005 at 03:44:03AM +0900, sujeet kumar wrote:
>
>
>>how can I find the url's of the search. Can i
>>do it by regular expression or any other way.
>>
>>
>The URI.extract method from the uri library can extract an array of uri's from
>a string:
>
>
A universal regexp that finds URIs from an abstract text is a
complicated thing, indeed. Besides, it can produce false positives
(finding things that look like URIs, but aren't).

If you are sure that the page is a well-formed XHTML (I'm not sure if
that's the case or not with Google), you might instead parse it with
REXML, and use XPath to retrieve href attributes of all <a>..</a>
elements, selecting only those that start with "http://" (there may also
be mailto:, ftp:, JavaScript calls etc).

Best regards,
Alexey Verkhovsky




Eric Hodel

6/12/2005 2:24:00 AM

0

On 11 Jun 2005, at 11:44, sujeet kumar wrote:

> hi
> I want to make a Tk window where you give some input string and it
> search that on google and prints the web address (http url) of the
> result found on google in the TkFrame of that window. My program
> connects to net and get the html source through function "http.get".
> Now from html source , how can I find the url's of the search. Can i
> do it by regular expression or any other way.

Why not use the Google API?

--
Eric Hodel - drbrain@segment7.net - http://se...
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04