Asp Forum
Home
|
Login
|
Register
|
Search
Forums
>
comp.lang.ruby
handling special characters
Sajal Kayan
8/4/2008 8:21:00 AM
Hi all.
I am very new to Ruby (5 days old) so my question might sound very
noobish. I am posting it only cause I couldn't find a solution.
I am using ruby to scrape content of a site.
To be precise I am having problems with the â?? character.
Sample source page :
http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=25...
The encoding is in tis-620 and I use Iconv to convert it to utf8,
however the special quote character gives the following error on iconv
/home/....../main.rb:37:in `iconv': "\222s announcement "...
(Iconv::IllegalSequence)
the affected code area
body = story.search("//font[@color=\"#333333\"]").inner_html
body = body.gsub(/<(.|\n)+?>/, "")
body = body.gsub(/�/, "\'")
puts body
body = Iconv.iconv("utf8", "tis-620", body) #<-- this is line 37
puts body
Or try the following on irb
require 'rubygems'
require 'net/http'
require 'open-uri'
require 'iconv'
story =
Hpricot(open('
http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=25...
'
))
body = story.search("//font[@color=\"#333333\"]").inner_html
body = body.gsub(/<(.|\n)+?>/, "")
body = body.gsub(/â??/, "\'")
puts body
no matter whatever i put in the "â??" it doesn't replace anything and the
iconv still gives errors.
I am looking for pointers on one of the following.
1) how do i replace "â??" to "'" ?
or 2) How can I make iconv ignore the "â??" ?
At first I thought this to be a I18n issue, but i guess getting rid of
the special character would be a simple string manipulation which i dont
get.
--
Posted via
http://www.ruby-...
.
2 Answers
Sajal Kayan
8/4/2008 8:23:00 AM
0
and oh. you would also need to
require 'mechanize'
in the irb to emulate the issue
require 'rubygems'
require 'net/http'
require 'open-uri'
require 'mechanize'
require 'iconv'
story =
Hpricot(open('
http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=2551080...
))
body = story.search("//font[@color=\"#333333\"]").inner_html
body = body.gsub(/<(.|\n)+?>/, "")
body = body.gsub(/â??/, "\'")
puts body
--
Posted via
http://www.ruby-...
.
Sajal Kayan
8/4/2008 9:56:00 AM
0
Heesob Park wrote:
> The ' character (0x92) is not in tis-620 but in windows-874 character
> set.
>
> Refer to
>
http://www.langbox.com/codeset/t...
>
http://www.microsoft.com/globaldev/reference/sbc...
>
> Try
> body = Iconv.iconv("utf-8", "windows-874", body).join
>
> Regards,
>
> Park Heesob
Awesome works like a charm now. Thanks for the prompt response.
Seems like the source site was putting in the wrong html headers.
You saved me from going bald :D
--
Posted via
http://www.ruby-...
.
Servizio di avviso nuovi messaggi
Ricevi direttamente nella tua mail i nuovi messaggi per
handling special characters
Inserendo la tua e-mail nella casella sotto, riceverai un avviso tramite posta elettronica ogni volta che il motore di ricerca troverà un nuovo messaggio per te
Il servizio è completamente GRATUITO!
x
Login to ForumsZone
Login with Google
Login with E-Mail & Password