[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Dir.entires and UTF-8

Timo Hoepfner

1/12/2006 3:21:00 PM

Hi,

What's going on here? Ths is on MacOS X 10.4.4. Looks like
Dir#entries returns strings encoded with some encoding I didn't
expect. How can I convert the string to UTF8?

$KCODE='UTF8'
require 'jcode'
s="äöüßÄÖÜ"
puts s.split(//).inspect
# => ["ä", "ö", "ü", "ß", "Ä", "Ö", "Ü"]
test_dir="/tmp/test"
`mkdir #{test_dir}`
`touch #{test_dir}/#{s}`
f=Dir.entries(test_dir).last
puts f.split(//).inspect
# => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]

Timo



5 Answers

Yukihiro Matsumoto

1/12/2006 4:18:00 PM

0

Hi,

On 1/13/06, Timo Hoepfner <th-dev@onlinehome.de> wrote:

> What's going on here? Ths is on MacOS X 10.4.4. Looks like
> Dir#entries returns strings encoded with some encoding I didn't
> expect. How can I convert the string to UTF8?

You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
decomposes character components as much as possible (Sorry I forgot
the correct term for this policy). So what you got:

> # => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]

is decomposed form of your string, a+umlaut, o+umlaut, etc.

matz.


Alex LeDonne

1/12/2006 4:31:00 PM

0

On 1/12/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Hi,
>
> On 1/13/06, Timo Hoepfner <th-dev@onlinehome.de> wrote:
>
> > What's going on here? Ths is on MacOS X 10.4.4. Looks like
> > Dir#entries returns strings encoded with some encoding I didn't
> > expect. How can I convert the string to UTF8?
>
> You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
> decomposes character components as much as possible (Sorry I forgot
> the correct term for this policy). So what you got:
>
> > # => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]
>
> is decomposed form of your string, a+umlaut, o+umlaut, etc.
>
> matz.
>

Matz refers to Unicode Normalization Form D (NFD). According to
http://developer.apple.com/technotes/tn/t... (HFS Plus Volume
Format):

"HFS Plus stores strings fully decomposed and in canonical order. HFS
Plus compares strings in a case-insensitive fashion. Strings may
contain Unicode characters that must be ignored by this comparison.
For more details on these subtleties, see Unicode Subtleties."

-A


Austin Ziegler

1/12/2006 4:32:00 PM

0

On 12/01/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> On 1/13/06, Timo Hoepfner <th-dev@onlinehome.de> wrote:
>> What's going on here? Ths is on MacOS X 10.4.4. Looks like
>> Dir#entries returns strings encoded with some encoding I didn't
>> expect. How can I convert the string to UTF8?
> You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
> decomposes character components as much as possible (Sorry I forgot
> the correct term for this policy). So what you got:

IIRC, that's the correct term. (Decomposed.)

-austin
--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca


Timo Hoepfner

1/13/2006 9:40:00 AM

0

>> How can I convert the string to UTF8?
>
> You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
> decomposes character components as much as possible (Sorry I forgot
> the correct term for this policy). So what you got:
>
>> # => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]
>
> is decomposed form of your string, a+umlaut, o+umlaut, etc.

Hi Matz, Austin and A.

Thanks for the clarification. Unicode is more comlex than it seems in
the first place...

Nevertheless that doesn't solve my current problem. What I'm trying
to do is to organize files within a directory into subfolders based
on the first N characters of the file name. Here's my code (w/o error
handling) which works fine for 8bit characters, but doesn't work for
e.g. umlauts:

$KCODE='UTF8'
require 'jcode'
require 'pathname'
require 'fileutils'
wd, len = Pathname.new(ARGV[0]), ARGV[1].to_i
files=wd.children.reject{|f| f.directory?}
files.each do |f|
dir = wd + Pathname.new(f.basename.to_s.split(//)[0..len-1].join)
dir.mkdir unless dir.exist?
FileUtils.mv f, dir
end

I guess I have to recompose the decomposed filename somehow. Are
there any tools for that in the standard library or somewhere else?

Thanks for your help,

Timo



Timo Hoepfner

1/17/2006 12:56:00 PM

0

Hi,

to answer my own question, here's a solution. Use the 'unicode' gem
and change the line

> dir = wd + Pathname.new(f.basename.to_s.split(//)[0..len-1].join)

to

dir = wd + Pathname.new(Unicode::compose(f.basename.to_s).split(//)
[0..len-1].join)

Then it works.

Timo