[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Youtube...urgent, please help

Arun Kumar

3/17/2009 4:43:00 AM

Hi,

I'm new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction. It works fine except for some sites like
http://www.y..., http://www... where i'll get errors like
'400 Bad Request' and 'getaddrinfo: Name or service not known
(SocketError)' respectively for each of the 2 sites. I came to know that
may be it is because the url is being redirected. But i'm not sure about
it. My code for html extraction is :

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'dbi'

puts "Enter domain name :"
domain = gets
#concatinating 'http://www.' with the url to open the page
url = "http://www."+domain
document = open(url)
#getting the original url of the site
url2 = document.base_uri.to_s

Can anybody please help. It is urgent. I'll be really greatful for those
who reply

Regards,
Arun Kumar

Attachments:
http://www.ruby-...attachment/3450/ht...

--
Posted via http://www.ruby-....

7 Answers

David Masover

3/17/2009 5:42:00 AM

0

Arun Kumar wrote:
> Hi,
>
> I'm new to ruby and my co. has given me an assignment in ruby. It is
> regarding html extraction.

You probably want Mechanize.

> domain = gets
> #concatinating 'http://www.' with the url to open the page
> url = "http://www."+domain
>

Take a look at that URL -- I'd say you don't need 'www' in that.

But I'm guessing what's hurting is the newline at the end of it.

Quick fix:

domain = gets.chomp
url = "http://#{domain}"

Arun Kumar

3/17/2009 5:58:00 AM

0

David Masover wrote:
> Arun Kumar wrote:
>> Hi,
>>
>> I'm new to ruby and my co. has given me an assignment in ruby. It is
>> regarding html extraction.
>
> You probably want Mechanize.
>
>> domain = gets
>> #concatinating 'http://www.' with the url to open the page
>> url = "http://www."+domain
>>
>
> Take a look at that URL -- I'd say you don't need 'www' in that.
>
> But I'm guessing what's hurting is the newline at the end of it.
>
> Quick fix:
>
> domain = gets.chomp
> url = "http://#{domain}"
Sorry to say David, I tried that but the same error is producing. Is it
because i've not set the user agent. Can u please tell me how to set the
user_agent for mozilla.
Thanks for ur immediate reply
--
Posted via http://www.ruby-....

Martin DeMello

3/17/2009 6:09:00 AM

0

On Tue, Mar 17, 2009 at 11:28 AM, Arun Kumar
<arunkumar@innovaturelabs.com> wrote:
> Sorry to say David, I tried that but the same error is producing. Is it
> because i've not set the user agent. Can u please tell me how to set the
> user_agent for mozilla.

http://mechanize.rubyforge.org/mechanize/EXAMPLES... has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
http://www.user-agents.org/index... has an extensive list, for
instance.

> Thanks for ur immediate reply

Don't do that, it's annoying.

martin

Arun Kumar

3/17/2009 6:26:00 AM

0

Martin DeMello wrote:
> On Tue, Mar 17, 2009 at 11:28 AM, Arun Kumar
> <arunkumar@innovaturelabs.com> wrote:
>> Sorry to say David, I tried that but the same error is producing. Is it
>> because i've not set the user agent. Can u please tell me how to set the
>> user_agent for mozilla.
>
> http://mechanize.rubyforge.org/mechanize/EXAMPLES... has some
> examples setting the user agent. Google around and see what the
> mozilla user agent should be -
> http://www.user-agents.org/index... has an extensive list, for
> instance.
>
>> Thanks for ur immediate reply
>
> Don't do that, it's annoying.
>
> martin

Can i use user-agents in hpricot? or if it can be used only for
mechanize. I've found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.
--
Posted via http://www.ruby-....

Serabe

3/17/2009 6:35:00 AM

0

2009/3/17 Arun Kumar <arunkumar@innovaturelabs.com>:

> Can i use user-agents in hpricot? or if it can be used only for
> mechanize. I've found a user-agent for mozilla :
> Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
> 1.1.4322; .NET CLR 2.0.50727)
> But still it is showing the same error.

I found this:

http://schf.uc.org/articles/2007/02/14/scraping-gmail-with-mechanize-a...

It scraps gmail. If my memory doesn't fail, it is one that gives you
some problems.

Cheers,

Serabe

--
http://www....

Martin DeMello

3/17/2009 6:53:00 AM

0

On Tue, Mar 17, 2009 at 11:55 AM, Arun Kumar
<arunkumar@innovaturelabs.com> wrote:
>
> Can i use user-agents in hpricot? or if it can be used only for
> mechanize.

Hpricot is an html parser, I don't think it concerns itself with
actually fetching the page. Use mechanize for that.

martin

David Masover

3/17/2009 7:23:00 PM

0

Martin DeMello wrote:
> On Tue, Mar 17, 2009 at 11:55 AM, Arun Kumar
> <arunkumar@innovaturelabs.com> wrote:
>
>> Can i use user-agents in hpricot? or if it can be used only for
>> mechanize.
>>
>
> Hpricot is an html parser, I don't think it concerns itself with
> actually fetching the page. Use mechanize for that.
>

What's more, mechanize doesn't even use hpricot anymore -- it uses nokogiri.