Michal Suchanek
2/26/2008 2:37:00 AM
On 25/02/2008, Simone Carletti <weppos@gmail.com> wrote:
> > I use it as a fallback mechanism when I can't reliably get the original
> charset from feeds.
>
>
> That's a great example, thank you.
> Unfortunately I don't have a real charset header to check. :( I must
> rely only on input string.
You can ask a crystal ball as well.
The multibyte encodings can be often distinguished by their structure
- utf-8, perhaps utf-16, the Asian encodings. If something passes for
a valid string in a multibyte encoding it very likely is a string in
that encoding.
However, the Latin 8bit encodings are all the same - 7bit ascii with
some mess attached in the upper 128 characters. By converting from any
of these you get perfectly valid utf-8 but different gibberish each
time. You can tell the ISO variant from the Windows variant sometimes
because some control characters are at different positions - and these
should not appear in text. But that does not help you at all - you
still don't know which of the latin encodings you got.
If you know the language (and it's one of the few supported) you can
use enca. If the language is not supported you can do the filter
yourself - basically you collect the set of accented (with 8th bit
set) characters in your language, and encode them in different
encodings (the dos and windows codepage, the iso encoding, any other
legacy encodings). You get sets of bytes that would usually overlap
but would contain some unique bytes. When you see that byte you know
what encoding you should use.
Good luck :-)
Michal