[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Ways to filter bad(unicode) characters

Berlin Brown

4/19/2006 4:16:00 PM

I am parsing a collection of URLS; URLS that seem to have Chinese/Indian
and other unicode characters. My question, how can I filter those out
while still leaving room for alpha-numeric and characters that are
typical of a URL or Title

For example I might get a URL with:
http://????????????-????????
title = ??????

where the ? represents some unicode character

I want to filter these out, but leave room for non-alphanumeric characters:

http://www...

--
Berlin Brown
(ramaza3 on freenode)
http://www.newspiritc...
http://www.newspiritc.../newforums
also checkout alpha version of botverse:
http://www.newspiritc...:8086/universe_home
4 Answers

baumanj

4/19/2006 4:39:00 PM

0

How about:

new_url = ''
url.each_byte {|b| new_url << b if b < 128 }

That should keep all the ASCII bytes and drop all the non-ASCII ones.

Ramza Brown wrote:
> I am parsing a collection of URLS; URLS that seem to have Chinese/Indian
> and other unicode characters. My question, how can I filter those out
> while still leaving room for alpha-numeric and characters that are
> typical of a URL or Title
>
> For example I might get a URL with:
> http://????????????-????????
> title = ??????
>
> where the ? represents some unicode character
>
> I want to filter these out, but leave room for non-alphanumeric characters:
>
> http://www...
>
> --
> Berlin Brown
> (ramaza3 on freenode)
> http://www.newspiritc...
> http://www.newspiritc.../newforums
> also checkout alpha version of botverse:
> http://www.newspiritc...:8086/universe_home

Berlin Brown

4/19/2006 6:16:00 PM

0

baumanj@gmail.com wrote:
> How about:
>
> new_url = ''
> url.each_byte {|b| new_url << b if b < 128 }
>
> That should keep all the ASCII bytes and drop all the non-ASCII ones.
>
> Ramza Brown wrote:
>
>>I am parsing a collection of URLS; URLS that seem to have Chinese/Indian
>>and other unicode characters. My question, how can I filter those out
>>while still leaving room for alpha-numeric and characters that are
>>typical of a URL or Title
>>
>>For example I might get a URL with:
>>http://????????????-????????
>>title = ??????
>>
>>where the ? represents some unicode character
>>
>>I want to filter these out, but leave room for non-alphanumeric characters:
>>
>>http://www...
>>
>>--
>>Berlin Brown
>>(ramaza3 on freenode)
>>http://www.newspiritc...
>>http://www.newspiritc.../newforums
>>also checkout alpha version of botverse:
>>http://www.newspiritc...:8086/universe_home
>
>
I didnt want to modify the URL so much as check the how valid (US valid
that is).



--
Berlin Brown
(ramaza3 on freenode)
http://www.newspiritc...
http://www.newspiritc.../newforums
also checkout alpha version of botverse:
http://www.newspiritc...:8086/universe_home

baumanj

4/19/2006 9:40:00 PM

0

In that case:

invalid = false
url.each_byte {|b| invalid = true if b > 127 }

Ramza Brown wrote:
> baumanj@gmail.com wrote:
> > How about:
> >
> > new_url = ''
> > url.each_byte {|b| new_url << b if b < 128 }
> >
> > That should keep all the ASCII bytes and drop all the non-ASCII ones.
> >
> > Ramza Brown wrote:
> >
> >>I am parsing a collection of URLS; URLS that seem to have Chinese/Indian
> >>and other unicode characters. My question, how can I filter those out
> >>while still leaving room for alpha-numeric and characters that are
> >>typical of a URL or Title
> >>
> >>For example I might get a URL with:
> >>http://????????????-????????
> >>title = ??????
> >>
> >>where the ? represents some unicode character
> >>
> >>I want to filter these out, but leave room for non-alphanumeric characters:
> >>
> >>http://www...
> >>
> >>--
> >>Berlin Brown
> >>(ramaza3 on freenode)
> >>http://www.newspiritc...
> >>http://www.newspiritc.../newforums
> >>also checkout alpha version of botverse:
> >>http://www.newspiritc...:8086/universe_home
> >
> >
> I didnt want to modify the URL so much as check the how valid (US valid
> that is).
>
>
>
> --
> Berlin Brown
> (ramaza3 on freenode)
> http://www.newspiritc...
> http://www.newspiritc.../newforums
> also checkout alpha version of botverse:
> http://www.newspiritc...:8086/universe_home

Dave Burt

4/19/2006 10:15:00 PM

0

baumanj@gmail.com wrote:
> In that case:
>
> invalid = false
> url.each_byte {|b| invalid = true if b > 127 }

Or:

require 'enumerator'
class String
def seven_bit_clean?
self.each_byte.all? {|c| c <= 127 }
end
end

Cheers,
Dave