Asp Forum - Ruby UTF-8 - comp.lang.ruby

pkchau

3/15/2005 5:20:00 PM

I'm working with Japanese character sets in Windows. I can save my
*.rb files with notepad using UTF-8 but I can't run them with Ruby.
This is what happens when I try to run it.

c:\> ruby -Ku myFile.rb
jpn.rb:1: undefined method `∩╗┐' for main:Object
(NoMethodError)

Am I doing something wrong?

My goal is the read/write strings (containing Japanese characters)
from a web browser. Is there a recommend way of doing this?

Peter

12 Answers

WoNáDo

3/15/2005 6:03:00 PM

--
Wolfgang Nádasi-Donner
wonado@donnerweb.de
"Peter C" <pkchau@gmail.com> schrieb im Newsbeitrag
news:2c9220c5.0503150920.6eedfbc9@posting.google.com...
> I'm working with Japanese character sets in Windows. I can save my
> *.rb files with notepad using UTF-8 but I can't run them with Ruby.
> This is what happens when I try to run it.
>
> c:\> ruby -Ku myFile.rb
> jpn.rb:1: undefined method `∩╗┐' for main:Object
> (NoMethodError)
>
>
> Am I doing something wrong?
>
> My goal is the read/write strings (containing Japanese characters)
> from a web browser. Is there a recommend way of doing this?
>
> Peter

The Windows-Editor writes always a "Byte Order Mark" (BOM) at the beginning
of UTF-8/16LE/16BE coded files. In this case a UTF-8 coded file begins with
"EF BB BF" (hex). These non-characters should usually be ignored (for more
information see http://www.un...).

One possibility is to remove the first three bytes of the UTF-8 encoded file
using some filter program or a hex editor. You should do this into a copy,
because the Windows editor cannot work correctly on this changed data :-((

Florian Gross

3/15/2005 8:35:00 PM

Wolfgang Nádasi-Donner wrote:

> "Peter C" <pkchau@gmail.com> schrieb im Newsbeitrag
> news:2c9220c5.0503150920.6eedfbc9@posting.google.com...
>
>>I'm working with Japanese character sets in Windows. I can save my
>>*.rb files with notepad using UTF-8 but I can't run them with Ruby.
>>This is what happens when I try to run it.
>>
>>c:\> ruby -Ku myFile.rb
>>jpn.rb:1: undefined method `∩╗┐' for main:Object
>>(NoMethodError)
>>
> The Windows-Editor writes always a "Byte Order Mark" (BOM) at the beginning
> of UTF-8/16LE/16BE coded files. In this case a UTF-8 coded file begins with
> "EF BB BF" (hex). These non-characters should usually be ignored (for more
> information see http://www.un...).
>
> One possibility is to remove the first three bytes of the UTF-8 encoded file
> using some filter program or a hex editor. You should do this into a copy,
> because the Windows editor cannot work correctly on this changed data :-((

Another one is to have an assignment to a scratch variable at the
beginning of the script. Ruby will parse the BOM as the part of the
variable name and thus not complain about it.

I've posted this to ruby-core a few months ago and would like to see it
fixed...

Nikolai Weibull

3/20/2005 2:25:00 PM

* Wolfgang Nï¿½dasi-Donner (Mar 15, 2005 19:10):
> > I'm working with Japanese character sets in Windows. I can save my
> > *.rb files with notepad using UTF-8 but I can't run them with Ruby.

> The Windows-Editor writes always a "Byte Order Mark" (BOM) at the
> beginning of UTF-8/16LE/16BE coded files. In this case a UTF-8 coded
> file begins with "EF BB BF" (hex). These non-characters should usually
> be ignored (for more information see http://www.un...).

Why does it write a BOM for UTF-8 encoded files? It's utterly
meaningless to discuss byte order for UTF-8 encoded text,
nikolai

--
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: minimalistic.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

WoNáDo

3/20/2005 2:33:00 PM

--
Wolfgang Nádasi-Donner
wonado@donnerweb.de
"Nikolai Weibull" <mailing-lists.ruby-talk@rawuncut.elitemail.org> schrieb
im Newsbeitrag news:20050320142500.GB6070@puritan.pcp.ath.cx...
> * Wolfgang N?dasi-Donner (Mar 15, 2005 19:10):
> > > I'm working with Japanese character sets in Windows. I can save my
> > > *.rb files with notepad using UTF-8 but I can't run them with Ruby.
>
> > The Windows-Editor writes always a "Byte Order Mark" (BOM) at the
> > beginning of UTF-8/16LE/16BE coded files. In this case a UTF-8 coded
> > file begins with "EF BB BF" (hex). These non-characters should usually
> > be ignored (for more information see http://www.un...).
>
> Why does it write a BOM for UTF-8 encoded files? It's utterly
> meaningless to discuss byte order for UTF-8 encoded text,
> nikolai
>
> --
> ::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
> ::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
> ::: page: minimalistic.org :: fun atm: gf,lps,ruby,lisp,war3 :::
> main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}
>
>

Simply said, because it is allowed by the Unicode Standard.

I assume that Microsoft uses it because the Notepad can decide which
Encoding is used in existing data. This means, that one cannot edit UTF-8
encoded Data using Notepad, if there is no appropriate BOM.

Florian Gross

3/20/2005 3:18:00 PM

Nikolai Weibull wrote:

>>>I'm working with Japanese character sets in Windows. I can save my
>>>*.rb files with notepad using UTF-8 but I can't run them with Ruby.
>
>>The Windows-Editor writes always a "Byte Order Mark" (BOM) at the
>>beginning of UTF-8/16LE/16BE coded files. In this case a UTF-8 coded
>>file begins with "EF BB BF" (hex). These non-characters should usually
>>be ignored (for more information see http://www.un...).
>
> Why does it write a BOM for UTF-8 encoded files? It's utterly
> meaningless to discuss byte order for UTF-8 encoded text,

So that it can identify the file as UTF-8 encoded in the future without
having to guess based on byte count, I assume.

I think that that behavior makes sense and would like to see it
supported in Ruby.

Lothar Scholz

3/20/2005 3:34:00 PM

Hello Florian,

FG> Nikolai Weibull wrote:

>>>>I'm working with Japanese character sets in Windows. I can save my
>>>>*.rb files with notepad using UTF-8 but I can't run them with Ruby.
>>
>>>The Windows-Editor writes always a "Byte Order Mark" (BOM) at the
>>>beginning of UTF-8/16LE/16BE coded files. In this case a UTF-8 coded
>>>file begins with "EF BB BF" (hex). These non-characters should usually
>>>be ignored (for more information see http://www.un...).
>>
>> Why does it write a BOM for UTF-8 encoded files? It's utterly
>> meaningless to discuss byte order for UTF-8 encoded text,

FG> So that it can identify the file as UTF-8 encoded in the future without
FG> having to guess based on byte count, I assume.

FG> I think that that behavior makes sense and would like to see it
FG> supported in Ruby.

Doesn't ruby CVS already do this ?
I thought i read something about this on the ruby core list.

If there is no BOM i would really recommend the way python is using
for specifying font encodings, it's simple and excellent.

--
Best regards, emailto: scholz at scriptolutions dot com
Lothar Scholz http://www.ru...
CTO Scriptolutions Ruby, PHP, Python IDE 's

Florian Gross

3/20/2005 4:09:00 PM

Lothar Scholz wrote:

> FG> So that it can identify the file as UTF-8 encoded in the future without
> FG> having to guess based on byte count, I assume.
>
> FG> I think that that behavior makes sense and would like to see it
> FG> supported in Ruby.
>
> Doesn't ruby CVS already do this ?
> I thought i read something about this on the ruby core list.

No idea, I'm on 1.8.2.

Sam Roberts

3/20/2005 4:13:00 PM

Quoting mailing-lists.ruby-talk@rawuncut.elitemail.org, on Sun, Mar 20, 2005 at 11:25:27PM +0900:
> * Wolfgang N???dasi-Donner (Mar 15, 2005 19:10):
> > > I'm working with Japanese character sets in Windows. I can save my
> > > *.rb files with notepad using UTF-8 but I can't run them with Ruby.
>
> > The Windows-Editor writes always a "Byte Order Mark" (BOM) at the
> > beginning of UTF-8/16LE/16BE coded files. In this case a UTF-8 coded
> > file begins with "EF BB BF" (hex). These non-characters should usually
> > be ignored (for more information see http://www.un...).
>
> Why does it write a BOM for UTF-8 encoded files? It's utterly
> meaningless to discuss byte order for UTF-8 encoded text,

It's a tag to indicate the data is UTF-8. If it wasn't there it could be
anything, iso-8859-*, koi8, ...

Apple does this in NSString when I convert UCS-2 data to UTF-8, too. It
confused me. I still don't like it, but I have a suspicion that its
partly because these OSes have a legacy "assumed encoding" for 8-bit
text, and that it isn't UTF-8.

In a way it means that there are two "dialects" of utf-8. This is
unfortunate.

Sam

Christian Neukirchen

3/20/2005 4:46:00 PM

Florian Gross <flgr@ccan.de> writes:

> Nikolai Weibull wrote:
>
>>>>I'm working with Japanese character sets in Windows. I can save my
>>>>*.rb files with notepad using UTF-8 but I can't run them with Ruby.
>>
>>>The Windows-Editor writes always a "Byte Order Mark" (BOM) at the
>>>beginning of UTF-8/16LE/16BE coded files. In this case a UTF-8 coded
>>>file begins with "EF BB BF" (hex). These non-characters should usually
>>>be ignored (for more information see http://www.un...).
>> Why does it write a BOM for UTF-8 encoded files? It's utterly
>> meaningless to discuss byte order for UTF-8 encoded text,
>
> So that it can identify the file as UTF-8 encoded in the future
> without having to guess based on byte count, I assume.
>
> I think that that behavior makes sense and would like to see it
> supported in Ruby.

To what extent do BOMs interfere with shebang-lines?

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneuk...

Nikolai Weibull

3/20/2005 5:04:00 PM

* Christian Neukirchen (Mar 20, 2005 17:50):
> To what extent do BOMs interfere with shebang-lines?

They can't coexist, unless the operating-system deals with a BOM
appropriately. Linux doesn't, for one,
nikolai

--
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: minimalistic.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

comp.lang.ruby

Ruby UTF-8

pkchau

WoNáDo

Florian Gross

Nikolai Weibull

WoNáDo

Florian Gross

Lothar Scholz

Florian Gross

Sam Roberts

Christian Neukirchen

Nikolai Weibull

x Login to ForumsZone