Asp Forum - How to parse a unicode url?

Dan The man

9/19/2007 10:06:00 PM

I would really like to be able to do the following. Is this even
possible?

Thanks,
nerdytenor

uri = URI.parse('http://www.hÃ¶re...) # not a real url (that I know
of)
URI::InvalidURIError: bad URI(is not URI?): http://www.hÃ...
from /usr/lib/ruby/1.8/uri/common.rb:432:in `split'
from /usr/lib/ruby/1.8/uri/common.rb:481:in `parse'
from (irb):26
--
Posted via http://www.ruby-....

7 Answers

7stud 7stud

9/20/2007 5:19:00 AM

Dan The man wrote:
> I would really like to be able to do the following. Is this even
> possible?
>
> Thanks,
> nerdytenor
>
> uri = URI.parse('http://www.hÃ¶re...) # not a real url (that I know
> of)
> URI::InvalidURIError: bad URI(is not URI?): http://www.hÃ...
> from /usr/lib/ruby/1.8/uri/common.rb:432:in `split'
> from /usr/lib/ruby/1.8/uri/common.rb:481:in `parse'
> from (irb):26

You can do this:

require "uri"

url = "http://www.hÃ¶re...&#...
enc_url = URI.encode(url)
puts enc_url

to get this:

http://www.h%C...

which according to wikipedia here:

http://en.wikipedia.org/wiki/Percen...

is a legal uri. But when I do this:

require "uri"

url = "http://www.hÃ¶re...
enc_url = URI.encode(url)
puts enc_url

uri = URI.parse(enc_url)

I get this:

http://www.h%C...
/usr/lib/ruby/1.8/uri/generic.rb:194:in `initialize': the scheme http
does not accept registry part: www.h%C3%B6ren.co (or bad hostname?)
(URI::InvalidURIError)
from /usr/lib/ruby/1.8/uri/http.rb:46:in `initialize'
from /usr/lib/ruby/1.8/uri/common.rb:484:in `new'
from /usr/lib/ruby/1.8/uri/common.rb:484:in `parse'
from r3test.rb:7

which as far as I can tell means that URI.parse() is broken.

--
Posted via http://www.ruby-....

Robert Klemme

9/20/2007 6:52:00 AM

On 20.09.2007 07:19, 7stud -- wrote:
> Dan The man wrote:
>> I would really like to be able to do the following. Is this even
>> possible?
>>
>> Thanks,
>> nerdytenor
>>
>> uri = URI.parse('http://www.höre...) # not a real url (that I know
>> of)
>> URI::InvalidURIError: bad URI(is not URI?): http://www.h...
>> from /usr/lib/ruby/1.8/uri/common.rb:432:in `split'
>> from /usr/lib/ruby/1.8/uri/common.rb:481:in `parse'
>> from (irb):26

There is no such thing as a Unicode URL. The RFC for URI and URL
specify the charset as 7Bit ASCII AFAIK.

The legal form of that URL is this: http://www.xn--hre...

See IDNA for details, for example:
http://de.wikipedia.org...

Quick searching revealed this - maybe it can help:
http://rubyforge.org/pipermail/idn-discuss/2005-September/0...

> You can do this:
>
> require "uri"
>
> url = "http://www.hören.co&q...
> enc_url = URI.encode(url)
> puts enc_url
>
>
> to get this:
>
> http://www.h%C...
>
> which according to wikipedia here:
>
> http://en.wikipedia.org/wiki/Percen...
>
> is a legal uri. But when I do this:
>
>
> require "uri"
>
> url = "http://www.höre...
> enc_url = URI.encode(url)
> puts enc_url
>
> uri = URI.parse(enc_url)
>
> I get this:
>
> http://www.h%C...
> /usr/lib/ruby/1.8/uri/generic.rb:194:in `initialize': the scheme http
> does not accept registry part: www.h%C3%B6ren.co (or bad hostname?)
> (URI::InvalidURIError)
> from /usr/lib/ruby/1.8/uri/http.rb:46:in `initialize'
> from /usr/lib/ruby/1.8/uri/common.rb:484:in `new'
> from /usr/lib/ruby/1.8/uri/common.rb:484:in `parse'
> from r3test.rb:7
>
>
> which as far as I can tell means that URI.parse() is broken.

I don't think so. There are invalid characters in the domain name (as
the exception indicates).

Kind regards

robert

7stud 7stud

9/20/2007 7:32:00 AM

Robert Klemme wrote:
>> which as far as I can tell means that URI.parse() is broken.
>
> I don't think so. There are invalid characters in the domain name (as
> the exception indicates).
>

Can you identify which character is invalid in:

http://www.h%C...

According to wikipedia, all those characters are valid for a uri.
--
Posted via http://www.ruby-....

Ollivier Robert

9/20/2007 11:56:00 AM

In article <aa11611ac34b5cc87350f239c546818f@ruby-forum.com>,
7stud -- <dolgun@excite.com> wrote:
>Can you identify which character is invalid in:
>
>http://www.h%C...
>
>According to wikipedia, all those characters are valid for a uri.

They are but host & domain names do not accept Unicode characters at all and are limited to 7 bits ASCII. Search for IDN for more information.
--
Ollivier ROBERT -=- EEC/RIF/SEU -=-
Systems Engineering Unit

Dan The man

9/20/2007 4:59:00 PM

Ollivier Robert wrote:
> In article <aa11611ac34b5cc87350f239c546818f@ruby-forum.com>,
> 7stud -- <dolgun@excite.com> wrote:
>>Can you identify which character is invalid in:
>>
>>http://www.h%C...
>>
>>According to wikipedia, all those characters are valid for a uri.
>
> They are but host & domain names do not accept Unicode characters at all
> and are limited to 7 bits ASCII. Search for IDN for more information.

I thought this might be the case. However, typing the following into
firefox gets me a real live page (after trying a few random domains)

http://www.hÃ...

Hmmm...
--
Posted via http://www.ruby-....

Arlen Cuss

9/26/2007 2:33:00 PM

> I thought this might be the case. However, typing the following into
> firefox gets me a real live page (after trying a few random domains)
>
> http://www.hÃ...
>
> Hmmm...

That's because Firefox automatically translates into the equivalent IDN;
attempting to lookup the domain www.hÃ¶ren.at (or hÃ¶ren.at) gives
NXDOMAIN - it will never work. www.hÃ¶ren.at is translated into the
punycode name www.xn--hren-5qa.at. Try visiting that in Firefox - it
actually rewrote the URL as www.hÃ¶ren.at for me, but rest assured the
domain is one and the same.

7 bit ASCII is to be used in domain names only; please use IDN to
represent international characters, as it's the accepted way to do it,
and implemented as you see in Firefox.

Cheers.

Arlen

Eric Hodel

9/26/2007 8:13:00 PM

On Sep 20, 2007, at 24:31 , 7stud -- wrote:
> Robert Klemme wrote:
>>> which as far as I can tell means that URI.parse() is broken.
>>
>> I don't think so. There are invalid characters in the domain name
>> (as
>> the exception indicates).
>>
>
> Can you identify which character is invalid in:
>
> http://www.h%C...
>
> According to wikipedia, all those characters are valid for a uri.

It says that what characters are valid for each piece of a URI is
dependent on the URI scheme. The characters valid for the hostname
part of the http URI scheme is goverened by the DNS system, so you
need to use an IDN.

I believe there is a ruby wrapper for libidn.

comp.lang.ruby

How to parse a unicode url?

Dan The man

7stud 7stud

Robert Klemme

7stud 7stud

Ollivier Robert

Dan The man

Arlen Cuss

Eric Hodel

x Login to ForumsZone