[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Sniffing encoding type by looking at file BOM header

python

3/24/2010 2:53:00 PM

I assume there's no standard library function that wraps
codecs.open() to sniff a file's BOM header and open the file with
the appropriate encoding?

My reading of the docs leads me to believe that there are 5
types of possible BOM headers with multiple names (synoymns?)
for the same BOM encoding type.

BOM = '\xff\xfe'
BOM_LE = '\xff\xfe'
BOM_UTF16 = '\xff\xfe'
BOM_UTF16_LE = '\xff\xfe'

BOM_BE = '\xfe\xff'
BOM32_BE = '\xfe\xff'
BOM_UTF16_BE = '\xfe\xff'

BOM64_BE = '\x00\x00\xfe\xff'
BOM_UTF32_BE = '\x00\x00\xfe\xff'

BOM64_LE = '\xff\xfe\x00\x00'
BOM_UTF32 = '\xff\xfe\x00\x00'
BOM_UTF32_LE = '\xff\xfe\x00\x00'

BOM_UTF8 = '\xef\xbb\xbf'

Is the process of writing a BOM sniffer readlly as simple
as detecting one of these 5 header types and then calling
codecs.open() with the appropriate encoding= parameter?

Note: I'm only interested in Unicode encodings. I am not
interested in any of the non-Unicode encodings supported
by the codecs module.

Thank you,
Malcolm
2 Answers

Lawrence D'Oliveiro

3/25/2010 11:17:00 PM

0

In message <mailman.1139.1269442366.23598.python-list@python.org>,
python@bdurham.com wrote:

> BOM_UTF8 = '\xef\xbb\xbf'

Since when does UTF-8 need a BOM?

Irmen de Jong

3/25/2010 11:22:00 PM

0

On 26-3-2010 0:16, Lawrence D'Oliveiro wrote:
> In message<mailman.1139.1269442366.23598.python-list@python.org>,
> python@bdurham.com wrote:
>
>> BOM_UTF8 = '\xef\xbb\xbf'
>
> Since when does UTF-8 need a BOM?

It doesn't, but it is allowed. Not recommended though.
Unfortunately several tools, such as notepad.exe, have a tendency of
silently adding it when saving files.

-irmen