Dirk Krause
6/9/2011 1:20:00 PM
"felix.leg" <felix.leg@vp.pl> wrote in news:isnn5i$cuu$1@news.onet.pl:
> Hi,
>
> I wrote a simple code to test unicode reading in C:
>
> [code]
> #include <stdlib.h>
> #include <stdio.h>
> #include <wctype.h>
>
> int main (int argc, char *argv[])
> {
> FILE* f;
> wint_t chrk;
>
> f = fopen("characters","r");
>
> while( (chrk = fgetwc(f)) != WEOF ) {
> wprintf(L"%C ,", chrk );
> }
> printf("\b.\n"); //#1
> fclose(f);
> return 0;
> }
> [/code]
>
> The file "characters" contains a list of characters from different
> Unicode blocks, each divided by a space. The first block contains
> characters vaild in both ASCII and Unicode.
>
> But the output is just:
>
> a ,b ,c ,d ,e ,f ,: ,; ,' ," ,[ ,] ,{ ,} , ,
> (it looks like the #1 line wasn't executed too)
>
> It stops after fgetwc reach the first non-ASCII character from file.
> I thought this function was *supposed to* read them...WTF?
Hello,
I don't know which system you are using.
When doing some experiments reading UTF-16 encoded files on Windows
I found I had to use
_setmode(_fileno(f), _O_U16TEXT);
after the fopen() before starting to read.
Depending on the encoding of your file you probably
need another constant, i.e. _O_U8TEXT if your text is UTF-8 encoded.
Before processing the file it is a good idea to inspect the start
of file if there is a byte order marker (BOM). This is the 0x0000FEFF
Unicode character. If the beginning of the file is
0x00 0x00 0xFE 0xFF you have 32-bit Unicode characters in the file,
MSB first.
0xFF 0xFE 0x00 0x00 you have 32-bit Unicode characters, LSB first.
0xFE 0xFF you have Unicode in UTF-16 encoding, MSB first.
0xFF 0xFE you have Unicode in UTF-16 encoding, LSB first.
0xEF 0xBB 0xBF you have UTF-8 encoded text.
none of the above you must guess.
Hope this helps,
Dirk