[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Unicode illegal characters problem

Axel Etzold

11/3/2007 3:38:00 PM

Dear all,

when using Iconv, I am repeatedly running into
problems.
I tried to run this bit of code:

#!/usr/bin/env ruby
$KCODE = 'u'
require 'iconv'

s = 'caffè'

ic_ignore = Iconv.new('US-ASCII//IGNORE', 'UTF-8')
puts ic_ignore.iconv(s) # => caff

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s) # => caff`e

(from here:
http://www.ruby-forum.com/t...),
but instead of the promised result in the comments above,
I am getting:

corr_ebook.rb:29:in `iconv': "\351" (Iconv::InvalidCharacter)
from corr_ebook.rb:29

Why ?
I am using ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] (OpenSuse 10.2)

Thank you very much!

Best regards,

Axel

--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/mult...

15 Answers

jh+ruby-lang

11/3/2007 3:49:00 PM

0

On Sat, 3 Nov 2007 10:38:22 -0500
AEtzold@gmx.de wrote:

> Dear all,
>
> when using Iconv, I am repeatedly running into
> problems.
> I tried to run this bit of code:
>
> #!/usr/bin/env ruby
> $KCODE = 'u'
> require 'iconv'
>
> s = 'caffè'
>
> ic_ignore = Iconv.new('US-ASCII//IGNORE', 'UTF-8')
> puts ic_ignore.iconv(s) # => caff
>
> ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
> puts ic_translit.iconv(s) # => caff`e
>
> (from here:
> http://www.ruby-forum.com/t...),
> but instead of the promised result in the comments above,
> I am getting:
>
> corr_ebook.rb:29:in `iconv': "\351" (Iconv::InvalidCharacter)
> from corr_ebook.rb:29
>
> Why ?
> I am using ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] (OpenSuse 10.2)
>
> Thank you very much!


Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)

man iso_8859-1 shows octal 351 as expected.

351 233 E9 é LATIN SMALL LETTER E WITH ACUTE

-jh

Axel Etzold

11/3/2007 4:21:00 PM

0


-------- Original-Nachricht --------
> Datum: Sun, 4 Nov 2007 00:55:04 +0900
> Von: Jonathan Hudson <jh+ruby-lang@daria.co.uk>
> An: ruby-talk@ruby-lang.org
> Betreff: Re: Unicode illegal characters problem

> On Sat, 3 Nov 2007 10:38:22 -0500
> AEtzold@gmx.de wrote:
>
> > Dear all,
> >
> > when using Iconv, I am repeatedly running into
> > problems.
> > I tried to run this bit of code:
> >
> > #!/usr/bin/env ruby
> > $KCODE = 'u'
> > require 'iconv'
> >
> > s = 'caff�¨'
> >
> > ic_ignore = Iconv.new('US-ASCII//IGNORE', 'UTF-8')
> > puts ic_ignore.iconv(s) # => caff
> >
> > ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
> > puts ic_translit.iconv(s) # => caff`e
> >
> > (from here:
> > http://www.ruby-forum.com/t...),
> > but instead of the promised result in the comments above,
> > I am getting:
> >
> > corr_ebook.rb:29:in `iconv': "\351" (Iconv::InvalidCharacter)
> > from corr_ebook.rb:29
> >
> > Why ?
> > I am using ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] (OpenSuse
> 10.2)
> >
> > Thank you very much!
>
>
> Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)
>
> man iso_8859-1 shows octal 351 as expected.
>
> 351 233 E9 é LATIN SMALL LETTER E WITH ACUTE
>
> -jh

Dear Jonathan,

thanks for the hint. You are right. I corrected the encoding
of the file I read the text in from,

$KCODE = 'u'
require 'iconv'
s=IO.readlines("/home/axel/text.txt").to_s
p s # => 'caffè'

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s) # => caff`e

However, now I still get
"caff?" instead of "caff`e" as promised.

I have several novel-length texts to convert with many
different accents.

Thanks for helping me again!

Best regards

Axel
--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/g...

jh+ruby-lang

11/3/2007 4:44:00 PM

0

On Sat, 3 Nov 2007 11:20:33 -0500
AEtzold@gmx.de wrote:

>
> -------- Original-Nachricht --------
> > Datum: Sun, 4 Nov 2007 00:55:04 +0900
> > Von: Jonathan Hudson <jh+ruby-lang@daria.co.uk>
> > An: ruby-talk@ruby-lang.org
> > Betreff: Re: Unicode illegal characters problem
>
> > On Sat, 3 Nov 2007 10:38:22 -0500
> > AEtzold@gmx.de wrote:
> >
> > > Dear all,
> > >
> > > when using Iconv, I am repeatedly running into
> > > problems.
> > > I tried to run this bit of code:
> > >
> > > #!/usr/bin/env ruby
> > > $KCODE = 'u'
> > > require 'iconv'
> > >
> > > s = 'caff�¨'
> > >
> > > ic_ignore = Iconv.new('US-ASCII//IGNORE', 'UTF-8')
> > > puts ic_ignore.iconv(s) # => caff
> > >
> > > ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
> > > puts ic_translit.iconv(s) # => caff`e
> > >
> > > (from here:
> > > http://www.ruby-forum.com/t...),
> > > but instead of the promised result in the comments above,
> > > I am getting:
> > >
> > > corr_ebook.rb:29:in `iconv': "\351" (Iconv::InvalidCharacter)
> > > from corr_ebook.rb:29
> > >
> > > Why ?
> > > I am using ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] (OpenSuse
> > 10.2)
> > >
> > > Thank you very much!
> >
> >
> > Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)
> >
> > man iso_8859-1 shows octal 351 as expected.
> >
> > 351 233 E9 é LATIN SMALL LETTER E WITH ACUTE
> >
> > -jh
>
> Dear Jonathan,
>
> thanks for the hint. You are right. I corrected the encoding
> of the file I read the text in from,
>
> $KCODE = 'u'
> require 'iconv'
> s=IO.readlines("/home/axel/text.txt").to_s
> p s # => 'caffè'
>
> ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
> puts ic_translit.iconv(s) # => caff`e
>
> However, now I still get
> "caff?" instead of "caff`e" as promised.
>


I believe that's a "feature" of ruby iconv.

$ echo café | iconv -f UTF-8 -t ASCII//TRANSLIT
cafe

while

s="café"
ic_translit = Iconv.new('ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s)
=> caf?

-jonathan

Axel Etzold

11/3/2007 5:06:00 PM

0

Dear Jonathan,

> I believe that's a "feature" of ruby iconv.

thanks for your clarifications!

Best regards,

Axel
--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/mult...

jh+ruby-lang

11/3/2007 5:40:00 PM

0

On Sat, 3 Nov 2007 12:06:21 -0500
AEtzold@gmx.de wrote:

> Dear Jonathan,
>
> > I believe that's a "feature" of ruby iconv.
>
> thanks for your clarifications!
>

Further, its a *feature* of iconv on **Linux**. On my FreeBSD box I
get the expected results, both from iconv in a shell and ruby => caf'e.

As, on Linux, the iconv application produces better results from ruby's
iconv, I tend to pipe data through iconv; at least I get a semblance
of usability that way.

-jonathan

Carlos

11/3/2007 9:01:00 PM

0

[Jonathan Hudson <jh+ruby-lang@daria.co.uk>, 2007-11-03 18.45 CET]
> On Sat, 3 Nov 2007 12:06:21 -0500
> AEtzold@gmx.de wrote:
>
> > Dear Jonathan,
> >
> > > I believe that's a "feature" of ruby iconv.
> >
> > thanks for your clarifications!
> >
>
> Further, its a *feature* of iconv on **Linux**. On my FreeBSD box I
> get the expected results, both from iconv in a shell and ruby => caf'e.
>
> As, on Linux, the iconv application produces better results from ruby's
> iconv, I tend to pipe data through iconv; at least I get a semblance
> of usability that way.

It's not iconv, it's your locale data (which iconv uses). In german, "ü" is
probably transliterated to ASCII as "ue". In spanish, as "u". There isn't a
single way to do it, and they are encoded in the system locale files.

Now, why ruby's iconv gives a different result than the program iconv... I
don't know. Maybe ruby hides some LC_* environment variables from the
library (wild -and probably incorrect- guess)...

Summing up: don't use iconv to transliterate to ASCII; build your own table
instead. (It's easy: the description of all latin letters with diacritics
follow the same pattern.)

Good luck.

--

7stud --

11/3/2007 10:39:00 PM

0

Axel Etzold wrote:
>>Jonathan Hudson wrote:
>>
>> I believe that's a "feature" of ruby iconv.
>>
>> $ echo café | iconv -f UTF-8 -t ASCII//TRANSLIT
>> cafe
>>
>> while
>>
>> s="café"
>> ic_translit = Iconv.new('ASCII//TRANSLIT', 'UTF-8')
>> puts ic_translit.iconv(s)
>> => caf?

> Dear Jonathan,
>
>> I believe that's a "feature" of ruby iconv.
>
> thanks for your clarifications!
>

How does that clarify things for you? I read the other thread, and that
doesn't clarify anything for me. Are you simply interpreting Jonathan
Hudson's statement to mean the other thread is wrong?

Also, I don't think it is very helpful to include every possible unicode
statement you can think of in an attempt solve unicode problems. For
instance, this line:

$KCODE = 'u'

Why are you including that line in your program? According to Ruby
Way(2nd), p. 141,

"...$KCODE...determines the behavior of many core methods that
manipulate strings. "

However, in the code you posted, as far as I can tell, you aren't
calling any methods where the $KCODE changes the way they work. Do you
just include that line anytime you are dealing with unicode, or did you
include it for some specific reason?

Thanks.


--
Posted via http://www.ruby-....

7stud --

11/3/2007 10:57:00 PM

0

Axel Etzold wrote:
> -------- Original-Nachricht --------
>> Datum: Sun, 4 Nov 2007 00:55:04 +0900
>> Von: Jonathan Hudson <jh+ruby-lang@daria.co.uk>
>> An: ruby-talk@ruby-lang.org
>> Betreff: Re: Unicode illegal characters problem
>
>> > $KCODE = 'u'
>> > (from here:
>> >
>> > Thank you very much!
>>
>>
>> Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)
>>
>> man iso_8859-1 shows octal 351 as expected.
>>
>> 351 233 E9 é LATIN SMALL LETTER E WITH ACUTE
>>
>> -jh
>
> Dear Jonathan,
>
> thanks for the hint. You are right. I corrected the encoding
> of the file I read the text in from,
>
> $KCODE = 'u'
> require 'iconv'
> s=IO.readlines("/home/axel/text.txt").to_s
> p s # =>
> 'caffè'
ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
> puts ic_translit.iconv(s) # => caff`e
>
> However, now I still get
> "caff?" instead of "caff`e" as promised.

Another data point:

require 'iconv'

s = "caf\_x_c3\_x_a9"
#The last char is the utf-8 encoding in hex format for 'e' with acute
#I added the underscores so that the encoding won't be rendered
#into the actual character

puts s

#I see cafe where the 'e' is an 'e' with acute, which means my
#display device understands utf-8.

ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
puts ic_translit.iconv(s)

#I see: caf'e


--
Posted via http://www.ruby-....

7stud --

11/3/2007 11:06:00 PM

0

Axel Etzold wrote:
>> > $KCODE = 'u'
>> > (from here:
>> >
>> > Thank you very much!
>>
>>
>> Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)
>>
>> man iso_8859-1 shows octal 351 as expected.
>>
>> 351 233 E9 é LATIN SMALL LETTER E WITH ACUTE
>>
>> -jh
>
> Dear Jonathan,
>
> thanks for the hint. You are right. I corrected the encoding
> of the file I read the text in from,
>
> $KCODE = 'u'
> require 'iconv'
> s=IO.readlines("/home/axel/text.txt").to_s
> p s # =>
> 'caffè'
ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
> puts ic_translit.iconv(s) # => caff`e
>
> However, now I still get
> "caff?" instead of "caff`e" as promised.
>

Try running this code:

require 'iconv'

s = "caf\_x_c3\_x_a9" #remove underscores
p s
#I see: caf\_303\_251 (without the underscores)
#\_303\_251 (without the underscores) is the utf-8
#encoding in octal format. I really hate that ruby
#displays octal format instead of hex format!


ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
new_s = ic_translit.iconv(s) # => caff`e

p new_s #I see: caf'e
--
Posted via http://www.ruby-....

Axel Etzold

11/3/2007 11:58:00 PM

0

Dear 7stud,

thanks for the effort that you are putting into
solving this problem.
When I thanked about the clarifications Jonathan
gave, I meant that I believe the solution I hoped
to get from the thread I got that code from in
the first place isn't going to work for me as
easily as thought.
I do indeed get a different behaviour for system
iconv and Ruby iconv, as Jonathan said.
With respect to the code you sent me, I
get:

require 'iconv'

s = "caf\xc3\xa9" #(having removed underscores)
p s # => caf\303\251"
ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
new_s = ic_translit.iconv(s) # => caff`e
p new_s #=> caf?


What system are you on ?
Mine is Linux OpenSuse 10.2, 64bit, Ruby 1.8.6.
If I had your behaviour on my system, this transliteration
would provide a nice conversion of the accents to Latex,
but now, I think I'll do maybe two dozen gsub lines ...
unless there already is some script that does
a Unicode name to Latex accent conversion, sth. like

small latin letter <lettername> with acute => \'{<lettername>} ?

Best regards,

Axel

-------- Original-Nachricht --------
> Datum: Sun, 4 Nov 2007 08:05:49 +0900
> Von: 7stud -- <bbxx789_05ss@yahoo.com>
> An: ruby-talk@ruby-lang.org
> Betreff: Re: Unicode illegal characters problem

> Axel Etzold wrote:
> >> > $KCODE = 'u'
> >> > (from here:
> >> >
> >> > Thank you very much!
> >>
> >>
> >> Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)
> >>
> >> man iso_8859-1 shows octal 351 as expected.
> >>
> >> 351 233 E9 ? LATIN SMALL LETTER E WITH ACUTE
> >>
> >> -jh
> >
> > Dear Jonathan,
> >
> > thanks for the hint. You are right. I corrected the encoding
> > of the file I read the text in from,
> >
> > $KCODE = 'u'
> > require 'iconv'
> > s=IO.readlines("/home/axel/text.txt").to_s
> > p s # =>
> > 'caff?'
> ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
> > puts ic_translit.iconv(s) # => caff`e
> >
> > However, now I still get
> > "caff?" instead of "caff`e" as promised.
> >
>
> Try running this code:
>
> require 'iconv'
>
> s = "caf\_x_c3\_x_a9" #remove underscores
> p s
> #I see: caf\_303\_251 (without the underscores)
> #\_303\_251 (without the underscores) is the utf-8
> #encoding in octal format. I really hate that ruby
> #displays octal format instead of hex format!
>
>
> ic_translit = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
> new_s = ic_translit.iconv(s) # => caff`e
>
> p new_s #I see: caf'e
> --
> Posted via http://www.ruby-....

--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/mult...