Asp Forum - fgetwc doesn't read Unicode

felix.leg

6/8/2011 11:42:00 AM

Hi,

I wrote a simple code to test unicode reading in C:

[code]
#include <stdlib.h>
#include <stdio.h>
#include <wctype.h>

int main (int argc, char *argv[])
{
FILE* f;
wint_t chrk;

f = fopen("characters","r");

while( (chrk = fgetwc(f)) != WEOF ) {
wprintf(L"%C ,", chrk );
}
printf("\b.\n"); //#1
fclose(f);
return 0;
}
[/code]

The file "characters" contains a list of characters from different
Unicode blocks, each divided by a space. The first block contains
characters vaild in both ASCII and Unicode.

But the output is just:

a ,b ,c ,d ,e ,f ,: ,; ,' ," ,[ ,] ,{ ,} , ,
(it looks like the #1 line wasn't executed too)

It stops after fgetwc reach the first non-ASCII character from file.
I thought this function was *supposed to* read them...WTF?

6 Answers

jt

6/8/2011 8:19:00 PM

felix.leg <felix.leg@vp.pl> wrote:
> I wrote a simple code to test unicode reading in C:

> [code]
> #include <stdlib.h>
> #include <stdio.h>
> #include <wctype.h>

> int main (int argc, char *argv[])
> {
> FILE* f;
> wint_t chrk;
>
> f = fopen("characters","r");
>
> while( (chrk = fgetwc(f)) != WEOF ) {
> wprintf(L"%C ,", chrk );
> }
> printf("\b.\n"); //#1
> fclose(f);
> return 0;
> }
> [/code]

> The file "characters" contains a list of characters from different
> Unicode blocks, each divided by a space. The first block contains
> characters vaild in both ASCII and Unicode.

> But the output is just:

> a ,b ,c ,d ,e ,f ,: ,; ,' ," ,[ ,] ,{ ,} , ,
> (it looks like the #1 line wasn't executed too)

> It stops after fgetwc reach the first non-ASCII character from file.
> I thought this function was *supposed to* read them...WTF?

Note: I am not an Unicode expert, but since you haven't got an
answer yet I give it a try.

I guess that what's in your file is in UTF8. And UTF8 is not wide
char - a wide char has a fixed width, depending on the system your
trying this on, of 16, 32 or even 64 bits, while UTF8 is a diffe-
rent encoding systen where the chars differ in length (and only
with UTF8 the ASCII characters still work). One thing you could
try to figure out if this might be the problem is to check the
value of errno after the read failed (as indicated by ferror()),
chances are that it is EILSEQ, indicating that the what fgetwc()
tried to read was not recognized as a proper wide char.

I have no idea if this is possible on all systems but on mine
(Linux) you can coax fgetwc() into dealing with UTF8 characters
as expected (i.e. automatically converting them to wide chars)
by setting the LC_CTYPE category of the locale to something
with UTF8. I.e. if I have an input file with UTF8 characters
and add to the start of your program

setlocale( LC_CTYPE, "en_US.UTF8" );

(after including <locale.h>) it starts to work in that fgetwc()
stops refusing to read in the characters. And if I check the
numerical values of the resulting wide chars they definitely
aren't the UTF8 characters from the input file anymore.

Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://t...

John Doe

6/9/2011 1:40:00 AM

On Wed, 08 Jun 2011 13:41:37 +0200, felix.leg wrote:

> I wrote a simple code to test unicode reading in C:

> while( (chrk = fgetwc(f)) != WEOF ) {

fgetwc() assumes that the file is in the locale's encoding. If the file
is in e.g. UTF-8, then you need to use a UTF-8 locale. This means
calling setlocale(LC_CTYPE, ...); if you use "" as the locale, then the
appropriate environment variables need to indicate a UTF-8 locale.

> wprintf(L"%C ,", chrk );
> }
> printf("\b.\n"); //#1

> (it looks like the #1 line wasn't executed too)

You can't mix wide and byte I/O on a stream. A stream starts out
unoriented; the first operation on a stream sets the orientation (so
the wprintf() call will make stdout wide-oriented). Once a stream has an
orientation, the orientation can't be changed[1], and operations which
require a different orientation (e.g. printf on a wide-oriented stream)
aren't permitted.

[1] You can, however, create a new stream with freopen(), which returns
an unoriented stream.

> It stops after fgetwc reach the first non-ASCII character from file. I
> thought this function was *supposed to* read them...WTF?

At startup, all locale categories default to "C" (which has ASCII as its
encoding). You have to call setlocale() to change it.

Dirk Krause

6/9/2011 1:20:00 PM

"felix.leg" <felix.leg@vp.pl> wrote in news:isnn5i$cuu$1@news.onet.pl:

> Hi,
>
> I wrote a simple code to test unicode reading in C:
>
> [code]
> #include <stdlib.h>
> #include <stdio.h>
> #include <wctype.h>
>
> int main (int argc, char *argv[])
> {
> FILE* f;
> wint_t chrk;
>
> f = fopen("characters","r");
>
> while( (chrk = fgetwc(f)) != WEOF ) {
> wprintf(L"%C ,", chrk );
> }
> printf("\b.\n"); //#1
> fclose(f);
> return 0;
> }
> [/code]
>
> The file "characters" contains a list of characters from different
> Unicode blocks, each divided by a space. The first block contains
> characters vaild in both ASCII and Unicode.
>
> But the output is just:
>
> a ,b ,c ,d ,e ,f ,: ,; ,' ," ,[ ,] ,{ ,} , ,
> (it looks like the #1 line wasn't executed too)
>
> It stops after fgetwc reach the first non-ASCII character from file.
> I thought this function was *supposed to* read them...WTF?

Hello,
I don't know which system you are using.
When doing some experiments reading UTF-16 encoded files on Windows
I found I had to use

_setmode(_fileno(f), _O_U16TEXT);

after the fopen() before starting to read.
Depending on the encoding of your file you probably
need another constant, i.e. _O_U8TEXT if your text is UTF-8 encoded.
Before processing the file it is a good idea to inspect the start
of file if there is a byte order marker (BOM). This is the 0x0000FEFF
Unicode character. If the beginning of the file is

0x00 0x00 0xFE 0xFF you have 32-bit Unicode characters in the file,
MSB first.
0xFF 0xFE 0x00 0x00 you have 32-bit Unicode characters, LSB first.
0xFE 0xFF you have Unicode in UTF-16 encoding, MSB first.
0xFF 0xFE you have Unicode in UTF-16 encoding, LSB first.
0xEF 0xBB 0xBF you have UTF-8 encoded text.
none of the above you must guess.

Hope this helps,

Dirk

felix.leg

6/12/2011 9:30:00 AM

W dniu 09.06.2011 03:39, Nobody pisze:
> On Wed, 08 Jun 2011 13:41:37 +0200, felix.leg wrote:
>
> fgetwc() assumes that the file is in the locale's encoding. If the file
> is in e.g. UTF-8, then you need to use a UTF-8 locale. This means
> calling setlocale(LC_CTYPE, ...); if you use "" as the locale, then the
> appropriate environment variables need to indicate a UTF-8 locale.
[..cut..]
> At startup, all locale categories default to "C" (which has ASCII as its
> encoding). You have to call setlocale() to change it.
>

So dealing with Unicode in C is inextricably linked with Locale?

John Doe

6/13/2011 1:09:00 AM

On Sun, 12 Jun 2011 11:30:19 +0200, felix.leg wrote:

>> fgetwc() assumes that the file is in the locale's encoding. If the file
>> is in e.g. UTF-8, then you need to use a UTF-8 locale. This means
>> calling setlocale(LC_CTYPE, ...); if you use "" as the locale, then the
>> appropriate environment variables need to indicate a UTF-8 locale.
> [..cut..]
>> At startup, all locale categories default to "C" (which has ASCII as its
>> encoding). You have to call setlocale() to change it.
>
> So dealing with Unicode in C is inextricably linked with Locale?

Dealing with "characters" in C is inextricably linked with the locale. You
can avoid using the locale for encoding by using e.g. iconv to convert
between wchar_t[] and char[] then reading/writing char[] with the standard
library. But there are still issues such as collating order, numeric
formatting, etc which are locale-dependent even with wchar_t[].

jt

6/13/2011 9:39:00 PM

felix.leg <felix.leg@vp.pl> wrote:
> W dniu 09.06.2011 03:39, Nobody pisze:
> > On Wed, 08 Jun 2011 13:41:37 +0200, felix.leg wrote:
> >
> > fgetwc() assumes that the file is in the locale's encoding. If the file
> > is in e.g. UTF-8, then you need to use a UTF-8 locale. This means
> > calling setlocale(LC_CTYPE, ...); if you use "" as the locale, then the
> > appropriate environment variables need to indicate a UTF-8 locale.
> [..cut..]
> > At startup, all locale categories default to "C" (which has ASCII as its
> > encoding). You have to call setlocale() to change it.

> So dealing with Unicode in C is inextricably linked with Locale?

It's not realy about C. The fundamental problem is that you have
a file in some encoding. That could be plain ASCII, Latin-1, one
of the cyrillic or chinese encodings, UTF-8, UTF-16, UTF-32 etc.
But what encoding it is can't be determined from the file (one,
at best, could try to make an educated guess by reading the whole
file and rule out some of the encodings due to some data being
impossible and perhaps a frequency analysis). So what encoding is
fgetwc() supposed to assume when it reads from the file? It relies
on you to tell it what encoding it's to use when reading and con-
verting to the encoding used for wide chars. And that's what you
do by setting the LC_CTYPE value of the locale. That's actually
not different from telling e.g. scanf() if to expect a dot as the
decimal point or a comma instead in floating point numbers. Per
default it will expect a dot but by setting LC_NUMERIC you can
make it expect a comma, i.e to understand that 0,75 is meant to
represent three quarters.
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://t...

comp.lang.c

fgetwc doesn't read Unicode

felix.leg

jt

John Doe

Dirk Krause

felix.leg

John Doe

jt

x Login to ForumsZone