[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Re: How are people making use of Iconv?

Andreas S.

12/21/2005 10:57:00 AM

Wilson Bilkovich wrote:
> Since Iconv jumped out of the pond and chewed on my leg the other
> week, I've been toying with the idea of a character-set conversion
> library implemented totally in Ruby, with identical behavior on every
> platform.
> However, I'm only using Iconv for simple things, like converting my
> music tags from Shift-JIS to UTF-8.

Well, that's all that Iconv is supposed to be used for.

> What 'serious' things are people using this for? Are there any unit
> tests? Any gems on RubyForge I can download containing projects that
> make use of Iconv?

Rails uses Iconv, at least in ActionMailer.

> What do you hate about Iconv?

I dislike that Iconv raises an exception when it finds characters it can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.

--
Posted via http://www.ruby-....


6 Answers

Paul Duncan

12/21/2005 2:55:00 PM

0

* Andreas S. (f@andreas-s.net) wrote:
[snipped]
> I dislike that Iconv raises an exception when it finds characters it can
> not convert. I would prefer if it could be made to ignore invalid
> characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.

We're doing something comparable in Raggle by trapping the exception and
stripping out the invalid character. Obviously this doesn't work
properly for multibyte characters, and you won't be able to use a lookup
table for arbitrary source encodings, but it's a start.

begin
# convert element_text to native charset (note: in this case we're
# converting from utf-8 to the native charset, but the only thing
# about the code that's utf-8 specific is the assumption about
# character width and the unicode lookup table below)
ret = $iconv.iconv(element_text) << $iconv.iconv(nil)
rescue Iconv::IllegalSequence => e
# save the portion of the string that was successful, the
# invalid character, and the remaining (pending) string
success_str = e.success
ch, pending_str = e.failed.split(//, 2)
ch_int = ch.to_i

# see if we have a map for that characters
if String::UNICODE_LUT.has_key?(ch_int)
# we have a mapping for this character, so convert it and
# re-process the string

# log status
err_str = _('converting unicode')
$log.warn(meth) { "#{err_str} ##{ch_int}" }

# create new string, with the bad character mapped
element_text = success_str + UNICODE_LUT[ch_int] + pending_str
else
if $config['iconv_munge_illegal']
# munge the illegal character with a safe string

# log status
err_str = _('munging unicode')
$log.warn(meth) { "#{err_str} ##{ch_int}" }

# create new string, with the bad character munged
munge_str = $config['unicode_munge_str']
element_text = success_str + munge_str + pending_str
else
# just drop the character altogether

# log status
err_str = _('dropping unicode')
$log.warn(meth) { "#{err_str} ##{ch_int}" }

# create new string, sans the bad character
element_text = success_str + pending_str
end
end
retry
end

Not a perfect solution, but it helps a bit.

--
Paul Duncan <pabs@pablotron.org> pabs in #ruby-lang (OPN IRC)
http://www.pabl... OpenPGP Key ID: 0x82C29562

Wilson Bilkovich

12/21/2005 3:55:00 PM

0

On 12/21/05, Paul Duncan <pabs@pablotron.org> wrote:
> * Andreas S. (f@andreas-s.net) wrote:
> [snipped]
> > I dislike that Iconv raises an exception when it finds characters it can
> > not convert. I would prefer if it could be made to ignore invalid
> > characters and just try to make the best of the text.
>
> Seconded, Thirded, and Quadrupled.
>
> Iconv needs a "as close as I could get with transliteration and ignoring
> invalid characters" mode.
>
> We're doing something comparable in Raggle by trapping the exception and
> stripping out the invalid character. Obviously this doesn't work
> properly for multibyte characters, and you won't be able to use a lookup
> table for arbitrary source encodings, but it's a start.
>
<snip interesting code>

What if String just had a couple of new methods on it:
String#transcode(from_encoding, to_encoding)
..and
String#transcode!(from_encoding, to_encoding)
..and the "modifies receiver" version returned true or false,
depending on whether it managed to convert every character?
Then you could do:
unless some_string.transcode!('Shift-JIS', 'UTF-8')
puts "Some characters got mangle-fied!"
end

Is that a mess? I kinda like it, at first glance.


Paul Duncan

12/21/2005 5:33:00 PM

0

* Wilson Bilkovich (wilsonb@gmail.com) wrote:
> On 12/21/05, Paul Duncan <pabs@pablotron.org> wrote:
> > * Andreas S. (f@andreas-s.net) wrote:
> > [snipped]
> > > I dislike that Iconv raises an exception when it finds characters it can
> > > not convert. I would prefer if it could be made to ignore invalid
> > > characters and just try to make the best of the text.
> >
> > Seconded, Thirded, and Quadrupled.
> >
> > Iconv needs a "as close as I could get with transliteration and ignoring
> > invalid characters" mode.
> >
> > We're doing something comparable in Raggle by trapping the exception and
> > stripping out the invalid character. Obviously this doesn't work
> > properly for multibyte characters, and you won't be able to use a lookup
> > table for arbitrary source encodings, but it's a start.
> >
> <snip interesting code>
>
> What if String just had a couple of new methods on it:
> String#transcode(from_encoding, to_encoding)
> ..and
> String#transcode!(from_encoding, to_encoding)
> ..and the "modifies receiver" version returned true or false,
> depending on whether it managed to convert every character?
> Then you could do:
> unless some_string.transcode!('Shift-JIS', 'UTF-8')
> puts "Some characters got mangle-fied!"
> end
>
> Is that a mess? I kinda like it, at first glance.

I know a future version of Ruby (2.0?) will make a distinction between
strings as arrays of bytes and strings as sets of characters with an
encoding (with the former being an obvious superset of the latter), so
I'm not sure how well that method would work with the new way of
handling strings.

That said, I like the idea, although I'd like an optional block to
handle unknown characters. I'd also add an hash as an optional third
argument which allows you to toggle transliteration, munging, and
exception behavior.

--
Paul Duncan <pabs@pablotron.org> pabs in #ruby-lang (OPN IRC)
http://www.pabl... OpenPGP Key ID: 0x82C29562

Christian Neukirchen

12/21/2005 7:25:00 PM

0

Paul Duncan <pabs@pablotron.org> writes:

> * Andreas S. (f@andreas-s.net) wrote:
> [snipped]
>> I dislike that Iconv raises an exception when it finds characters it can
>> not convert. I would prefer if it could be made to ignore invalid
>> characters and just try to make the best of the text.
>
> Seconded, Thirded, and Quadrupled.
>
> Iconv needs a "as close as I could get with transliteration and ignoring
> invalid characters" mode.

Can't you just use //IGNORE?

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneuk...


Paul Duncan

12/21/2005 7:31:00 PM

0

* Christian Neukirchen (chneukirchen@gmail.com) wrote:
> Paul Duncan <pabs@pablotron.org> writes:
>
> > * Andreas S. (f@andreas-s.net) wrote:
> > [snipped]
> >> I dislike that Iconv raises an exception when it finds characters it can
> >> not convert. I would prefer if it could be made to ignore invalid
> >> characters and just try to make the best of the text.
> >
> > Seconded, Thirded, and Quadrupled.
> >
> > Iconv needs a "as close as I could get with transliteration and ignoring
> > invalid characters" mode.
>
> Can't you just use //IGNORE?

I wasn't aware of "//IGNORE". I'll check it out. Thanks!

--
Paul Duncan <pabs@pablotron.org> pabs in #ruby-lang (OPN IRC)
http://www.pabl... OpenPGP Key ID: 0x82C29562

Paul Duncan

12/21/2005 7:35:00 PM

0

* Christian Neukirchen (chneukirchen@gmail.com) wrote:
> Paul Duncan <pabs@pablotron.org> writes:
>
> > * Andreas S. (f@andreas-s.net) wrote:
> > [snipped]
> >> I dislike that Iconv raises an exception when it finds characters it can
> >> not convert. I would prefer if it could be made to ignore invalid
> >> characters and just try to make the best of the text.
> >
> > Seconded, Thirded, and Quadrupled.
> >
> > Iconv needs a "as close as I could get with transliteration and ignoring
> > invalid characters" mode.
>
> Can't you just use //IGNORE?

You sir, are a genius. That works great here.

--
Paul Duncan <pabs@pablotron.org> pabs in #ruby-lang (OPN IRC)
http://www.pabl... OpenPGP Key ID: 0x82C29562