[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

open-uri bug

Steve H.

2/26/2008 7:47:00 PM

Hello all, I'm using open-uri combined with hpricot to make a basic
web crawler that scrapes for different links that I need. It seems to
be working perfectly, but I have encountered the following bug when
this type of link is encountered:

irb(main):015:0> URI.parse('http://hello.com/a.p...)
URI::InvalidURIError: bad URI(is not URI?): http://hello.co...
from c:/ruby/lib/ruby/1.8/uri/common.rb:436:in `split'
from c:/ruby/lib/ruby/1.8/uri/common.rb:485:in `parse'
from (irb):15

Can anyone illuminate why this is a problem? Thanks!
5 Answers

Rob Biedenharn

2/26/2008 8:20:00 PM

0

On Feb 26, 2008, at 2:50 PM, Steve H. wrote:

> Hello all, I'm using open-uri combined with hpricot to make a basic
> web crawler that scrapes for different links that I need. It seems to
> be working perfectly, but I have encountered the following bug when
> this type of link is encountered:
>
> irb(main):015:0> URI.parse('http://hello.com/a.p...)
> URI::InvalidURIError: bad URI(is not URI?): http://hello.co...
> from c:/ruby/lib/ruby/1.8/uri/common.rb:436:in `split'
> from c:/ruby/lib/ruby/1.8/uri/common.rb:485:in `parse'
> from (irb):15
>
> Can anyone illuminate why this is a problem? Thanks!


Probably because %1 looks like a partially escaped character. Try:

?%251
Where %25 is an escaped %

-Rob

Rob Biedenharn http://agileconsult...
Rob@AgileConsultingLLC.com



Steve H.

2/26/2008 9:16:00 PM

0

On Feb 26, 12:20 pm, Rob Biedenharn <R...@AgileConsultingLLC.com>
wrote:
> Probably because %1 looks like a partially escaped character. Try:
>
> ?%251
> Where %25 is an escaped %
>
> -Rob
>

I appreciate the reply. This is a bit unfortunate, I am developing a
tool which has to handle URIs the same way the browser does. While I
realize that is not a "correct" URI, the browser still fetches the
pages without a problem. In some sense, I wish I could mirror the
functionality of the browser fetch using the URI module. Anyhow, thank
you for your help!

Siep Korteling

2/26/2008 9:44:00 PM

0

Steve H. wrote:
> On Feb 26, 12:20 pm, Rob Biedenharn <R...@AgileConsultingLLC.com>
> wrote:
>> Probably because %1 looks like a partially escaped character. Try:
>>
>> ?%251
>> Where %25 is an escaped %
>>
>> -Rob
>>
>
> I appreciate the reply. This is a bit unfortunate, I am developing a
> tool which has to handle URIs the same way the browser does. While I
> realize that is not a "correct" URI, the browser still fetches the
> pages without a problem. In some sense, I wish I could mirror the
> functionality of the browser fetch using the URI module. Anyhow, thank
> you for your help!

Maybe this helps:

URI.escape('http://hello.com/a.p...)

=> "http://hello.com/a.php?...

Regards,

Siep
--
Posted via http://www.ruby-....

Eric Hodel

2/27/2008 4:52:00 AM

0

On Feb 26, 2008, at 13:20 PM, Steve H. wrote:
> On Feb 26, 12:20 pm, Rob Biedenharn <R...@AgileConsultingLLC.com>
> wrote:
>> Probably because %1 looks like a partially escaped character. Try:
>>
>> ?%251
>> Where %25 is an escaped %
>>
>> -Rob
>>
>
> I appreciate the reply. This is a bit unfortunate, I am developing a
> tool which has to handle URIs the same way the browser does. While I
> realize that is not a "correct" URI, the browser still fetches the
> pages without a problem. In some sense, I wish I could mirror the
> functionality of the browser fetch using the URI module. Anyhow, thank
> you for your help!

What about Mechanize?

Piyush Ranjan

3/1/2008 8:45:00 AM

0

[Note: parts of this message were removed to make it a legal post.]

I too want to know how to handle invalid URIs in mechanize. Is there any way
to override url checking ?

On Wed, Feb 27, 2008 at 10:21 AM, Eric Hodel <drbrain@segment7.net> wrote:

> On Feb 26, 2008, at 13:20 PM, Steve H. wrote:
> > On Feb 26, 12:20 pm, Rob Biedenharn <R...@AgileConsultingLLC.com>
> > wrote:
> >> Probably because %1 looks like a partially escaped character. Try:
> >>
> >> ?%251
> >> Where %25 is an escaped %
> >>
> >> -Rob
> >>
> >
> > I appreciate the reply. This is a bit unfortunate, I am developing a
> > tool which has to handle URIs the same way the browser does. While I
> > realize that is not a "correct" URI, the browser still fetches the
> > pages without a problem. In some sense, I wish I could mirror the
> > functionality of the browser fetch using the URI module. Anyhow, thank
> > you for your help!
>
> What about Mechanize?
>
>