Asp Forum - Ruby 1.8.* convert string to utf-8

Pavel Drobushevich

8/18/2008 12:30:00 PM

Hi all,
I has problem with convert string or file to utf-8 encoding. input file
can has different encoding. Ruby 1.9 has feature force_encoding of
String class. But in 1.8.* I found Iconv.conv('UTF-8', from, contents),
but I didn't find method to define input encoding. I found to ext libs:
1. rchardet - but this lib incorrect define UCS-2LE encoding(
2. libcharguess - butI can't male source to lib for ruby used
extconf.rb.
Please, help me to find other way, or anybody has ready libcharguess for
ruby.
Many thanks in advance.
--
Posted via http://www.ruby-....

7 Answers

Axel Etzold

8/18/2008 1:39:00 PM

-------- Original-Nachricht --------
> Datum: Mon, 18 Aug 2008 21:30:10 +0900
> Von: Pavel Drobushevich <p.drobushevich@gmail.com>
> An: ruby-talk@ruby-lang.org
> Betreff: Ruby 1.8.* convert string to utf-8

> Hi all,
> I has problem with convert string or file to utf-8 encoding. input file
> can has different encoding. Ruby 1.9 has feature force_encoding of
> String class. But in 1.8.* I found Iconv.conv('UTF-8', from, contents),
> but I didn't find method to define input encoding. I found to ext libs:
> 1. rchardet - but this lib incorrect define UCS-2LE encoding(
> 2. libcharguess - butI can't male source to lib for ruby used
> extconf.rb.
> Please, help me to find other way, or anybody has ready libcharguess for
> ruby.
> Many thanks in advance.
> --
> Posted via http://www.ruby-....

Dear Pavel,

you can use "from" to define the input encoding, like so :

require 'iconv'

s =IO.read('kknta10.txt')

ic = Iconv.iconv('utf-8', 'cp1251',s)
f=File.new("t.txt","w")
f.puts ic
f.close

I was able to convert a file in Windows-cp1251 by Pushkin as given in the Project Gutenberg
(http://www.gutenberg.org/...) like this to utf-8 and I could open it in OpenOffice, to get
clear text.
You might also look here:

http://markmail.org/message/btgxcrle666bgiiq#query:iconv%20ruby%20encodings%20cyrillic+page:1+mid:bursga6zqh6kvjl3+sta...

Best regards,

Axel

--
GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen!
Jetzt dabei sein: http://www.shortview.de/wasistshortview.php?mc=sv_...

Pavel Drobushevich

8/18/2008 3:23:00 PM

Dear Axel,
Thank you at the your answer.
But maybe I didn't good explain my problem, I have some problem with
English.

> require 'iconv'
>
> s =IO.read('kknta10.txt')
>
> ic = Iconv.iconv('utf-8', 'cp1251',s)
> f=File.new("t.txt","w")
> f.puts ic
> f.close
>

It's good idea, and I used it. But, I has many files with different
encoding: utf8, cp1251, ucs-2le, .... and I need convert all this files
to utf-8 by one code, I need to identify encoding of file in run time,
not to fix with const for every file, because files generate other
system.
Thanks
--
Posted via http://www.ruby-....

Axel Etzold

8/18/2008 7:05:00 PM

-------- Original-Nachricht --------
> Datum: Tue, 19 Aug 2008 00:23:17 +0900
> Von: Pavel Drobushevich <p.drobushevich@gmail.com>
> An: ruby-talk@ruby-lang.org
> Betreff: Re: Ruby 1.8.* convert string to utf-8

> Dear Axel,
> Thank you at the your answer.
> But maybe I didn't good explain my problem, I have some problem with
> English.
>
> > require 'iconv'
> >
> > s =IO.read('kknta10.txt')
> >
> > ic = Iconv.iconv('utf-8', 'cp1251',s)
> > f=File.new("t.txt","w")
> > f.puts ic
> > f.close
> >
>
> It's good idea, and I used it. But, I has many files with different
> encoding: utf8, cp1251, ucs-2le, .... and I need convert all this files
> to utf-8 by one code, I need to identify encoding of file in run time,
> not to fix with const for every file, because files generate other
> system.
> Thanks
> --
> Posted via http://www.ruby-....

Dear Pavel,

maybe it's a good idea to ask this question on a specialised Russian language
Ruby forum, but otherwise, I'd say that across (sufficiently long) Russian documents,
the most frequent letters will (most often) be the same.

You could count the frequencies of the letters in your documents like so

class Array
def count
k=Hash.new(0)
self.each{|x| k[x]+=1}
k
end
end

s =IO.read( input_file_name ).split(//).count
p s
freq=s.sort{|x,y| x[1]<=>y[1]}

I'd then convert the, say five most frequent
letters into each of the possible encodings.

For the large number of files, you count the frequencies also,
and select the encoding which contains the greatest number of
common keys to the "five most frequent letters in file X hash".

Best regards,

Axel

--
GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen!
Jetzt dabei sein: http://www.shortview.de/wasistshortview.php?mc=sv_...

Michal Suchanek

8/19/2008 4:00:00 AM

On 18/08/2008, Pavel Drobushevich <p.drobushevich@gmail.com> wrote:
> Dear Axel,
> Thank you at the your answer.
> But maybe I didn't good explain my problem, I have some problem with
> English.
>
>
> > require 'iconv'
> >
> > s =IO.read('kknta10.txt')
> >
> > ic = Iconv.iconv('utf-8', 'cp1251',s)
> > f=File.new("t.txt","w")
> > f.puts ic
> > f.close
> >
>
>
> It's good idea, and I used it. But, I has many files with different
> encoding: utf8, cp1251, ucs-2le, .... and I need convert all this files
> to utf-8 by one code, I need to identify encoding of file in run time,
> not to fix with const for every file, because files generate other
> system.
> Thanks
>

Have you tried enca?

http://dl.cihar.com/MI...

The functionality of this kind software is necessarily limited but if
you know you will only use documents in Russian it should give pretty
good results.

HTH

Michal

Pavel Drobushevich

8/19/2008 12:02:00 PM

Dear Axel,
Thank you for advice.
Dear Michal,
Thank you. It's good util.

But I prefer use more simple util tellenc (http://wyw...).
--
Posted via http://www.ruby-....

Michal Suchanek

8/19/2008 7:10:00 PM

On 19/08/2008, Pavel Drobushevich <p.drobushevich@gmail.com> wrote:
> Dear Axel,
> Thank you for advice.
> Dear Michal,
> Thank you. It's good util.
>
> But I prefer use more simple util tellenc (http://wyw...).

tellenc is limited to English and Chinese (and multibyte encodings for
other languages). At least it would seem so from the list of supported
encodings on the web page.

However, enca tries to leverage the fact that a language does not use
all positions in a single-byte encoding so if you give the language it
can tell apart different single-byte encodings with good accuracy.

Thanks

Michal

Pavel Drobushevich

8/20/2008 6:45:00 AM

Dear Michal,

Thank for explain. But in my task enough only English, but in feature I
will see enca.

Regards,
Pavel.
--
Posted via http://www.ruby-....

comp.lang.ruby

Ruby 1.8.* convert string to utf-8

Pavel Drobushevich

Axel Etzold

Pavel Drobushevich

Axel Etzold

Michal Suchanek

Pavel Drobushevich

Michal Suchanek

Pavel Drobushevich

x Login to ForumsZone