Asp Forum - Reliable character encodings conversion

Hubert Lepicki

9/30/2008 12:30:00 PM

SGksCgpJIGFtIGxvb2tpbmcgZm9yIHJlbGlhYmxlIGFuZCBlcnJvci1yZXNpc3RhbnQgd2F5IHRv
IGNvbnZlcnQgY2hhcmFjdGVyCmVuY29kaW5ncyB0byBVVEY4LiBJbnB1dCBlbmNvZGluZ3MgdmFy
eSwgYW5kIEkgaGF2ZSBxdWl0ZSBnb29kIGlucHV0CmVuY29kaW5ncyBkZXRlY3Rpb24gaW4gcGxh
Y2UuCgpJIGFtIHVzaW5nIEljb252IGxpYnJhcnkgd3JhcHBlciB0byBjb252ZXJ0IHRleHRzIHRv
IFVURjgsIGJ1dCBpdCdzCnRocm93aW5nICJJY29udjo6SWxsZWdhbFNlcXVlbmNlIiBleGNlcHRp
b24uIFRoZSBwcm9ibGVtIGlzIHRoYXQgaW5wdXQKdGV4dHMgYXJlIHVzZXItZ2VuZXJhdGVkIGFu
ZCBoYXZlIHNvbWV0aW1lcyBtaXhlZCBjaGFyYWN0ZXJzCmVuY29kaW5ncy4KCkRvZXMgYW55b25l
IGhhdmUgYW55IGV4cGVyaWVuY2Ugd2l0aCB0aGVzZSBraW5kIG9mIHNpdHVhdGlvbnMsIG9yIGNh
bgpzdWdnZXN0IGFsdGVybmF0aXZlIGxpYnJhcmllcz8KClRoYW5rcywKSHViZXJ0CgotLSAKUG96
ZHJhd2lhbSwKSHViZXJ0IMWBxJlwaWNraQogLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0KWyBodHRwOi8vaHViZXJ0bGVwaWNraS5jb20gXQo=

3 Answers

James Gray

9/30/2008 1:04:00 PM

On Sep 30, 2008, at 7:30 AM, Hubert =C5=81=C4=99picki wrote:

> I am using Iconv library wrapper to convert texts to UTF8, but it's
> throwing "Iconv::IllegalSequence" exception.

You can add a //TRANSLIT to the end of the "to" encoding to have Iconv =20=

attempt to convert characters to reasonable equivalents in that =20
encoding. This is usually more helpful when your input is all one =20
encoding and just has some characters that won't translate well (like =20=

a UTF-8 =E2=80=A6 going to ISO-8859-1).

Your case of mixed encodings is probably best handled with //IGNORE =20
instead, which asks Iconv to skip over any characters that cannot be =20
converted. You will loose some data with this, but it will convert =20
what it can.

You can also use //TRANSLIT//IGNORE to convert what can be converted =20
and skip the rest.

Hope that helps.

James Edward Gray II=

James Gray

9/30/2008 1:59:00 PM

On Sep 30, 2008, at 8:20 AM, Hubert =C5=81=C4=99picki wrote:

> 2008/9/30 James Gray <james@grayproductions.net>:
>> On Sep 30, 2008, at 7:30 AM, Hubert =C5=81=C4=99picki wrote:
>>
>>> I am using Iconv library wrapper to convert texts to UTF8, but it's
>>> throwing "Iconv::IllegalSequence" exception.
>>
>> You can add a //TRANSLIT to the end of the "to" encoding to have =20
>> Iconv
>> attempt to convert characters to reasonable equivalents in that =20
>> encoding.
>> This is usually more helpful when your input is all one encoding =20
>> and just
>> has some characters that won't translate well (like a UTF-8 =E2=80=A6 =
=20
>> going to
>> ISO-8859-1).
>>
>> Your case of mixed encodings is probably best handled with //IGNORE =20=

>> instead,
>> which asks Iconv to skip over any characters that cannot be =20
>> converted. You
>> will loose some data with this, but it will convert what it can.
>>
>> You can also use //TRANSLIT//IGNORE to convert what can be =20
>> converted and
>> skip the rest.
>>
>
> Thanks, //IGNORE//TRANSLIT seems to help a bit - but it's not perfect.

You listed those backwards. Is that really what you tried? Does =20
reversing them make any difference?

James Edward Gray II=

Marcin Raczkowski

9/30/2008 2:34:00 PM

you can use RChardet library,

her'es what i use:

require 'rchardet'

class String
def encoding
@encoding ||= guess_encoding
end

def encoding=(new)
@encoding = new
end

def convert_to(new)
self.replace(Iconv.iconv(new, encoding, self)[0])
@encoding = new
end

def guess_encoding
@encoding = CharDet.guess(self)["encoding"]
end

# this enables "foo".convert :us-ascii => :utf8
def convert(hash)
from = hash.keys[0]
to = hash[from]
self.replace(Iconv.iconv(to, from, self)[0])
end
end

it handles translating preatty well :)

comp.lang.ruby

Reliable character encodings conversion

Hubert Lepicki

James Gray

James Gray

Marcin Raczkowski

x Login to ForumsZone