Asp Forum - Detect file encoding utf-8

Rebhan, Gilbert

8/29/2007 12:14:00 PM

Hi,

I want to check the file encoding of files in a directory.
Until now i have tried =

# found in an older thread in comp.lang.ruby
class String
def utf8?
unpack('U*') rescue return false
true
end
end
# found in an older thread in comp.lang.ruby

utf=Array.new
others=Array.new
Dir["Y:/test/**/*.xml"].each do |path|
open(path) { |f|
(f.read.utf8?) ? uts<<path : others<<path
}
end

and also tried the chardet Library (no ruby documentation included)
like that

require 'UniversalDetector'

utf=Array.new
others=Array.new
Dir["Y:/test/**/*.xml"].each do |path|
open(path) { |f|
UniversalDetector.chardet(f.read) =~ /utf-8/ ?
uts<<path : others<<path
}
end
puts utf.join(",")
puts others.join(",")

Are there better / simpler ways ?

Regards, Gilbert

3 Answers

Richard Conroy

8/29/2007 1:12:00 PM

You could use some regular expressions, to search for code points in
your source string that are outside of what is legal for UTF-8.

Basically you assume it is UTF-8, and then reject it if it contains illegal
or unknown code points.

On 8/29/07, Rebhan, Gilbert <Gilbert.Rebhan@huk-coburg.de> wrote:
>
> Hi,
>
> I want to check the file encoding of files in a directory.
> Until now i have tried =
>
> # found in an older thread in comp.lang.ruby
> class String
> def utf8?
> unpack('U*') rescue return false
> true
> end
> end
> # found in an older thread in comp.lang.ruby
>
> utf=Array.new
> others=Array.new
> Dir["Y:/test/**/*.xml"].each do |path|
> open(path) { |f|
> (f.read.utf8?) ? uts<<path : others<<path
> }
> end
>
> and also tried the chardet Library (no ruby documentation included)
> like that
>
> require 'UniversalDetector'
>
> utf=Array.new
> others=Array.new
> Dir["Y:/test/**/*.xml"].each do |path|
> open(path) { |f|
> UniversalDetector.chardet(f.read) =~ /utf-8/ ?
> uts<<path : others<<path
> }
> end
> puts utf.join(",")
> puts others.join(",")
>
>
> Are there better / simpler ways ?
>
> Regards, Gilbert
>
>
>
>

Xavier Noria

8/29/2007 2:07:00 PM

On Aug 29, 2007, at 2:14 PM, Rebhan, Gilbert wrote:

> I want to check the file encoding of files in a directory.

Have you tried charguess?

http://raa.ruby-lang.org/project...

-- fxn

Gilbert Rebhan

8/29/2007 6:45:00 PM

Xavier Noria wrote:
> On Aug 29, 2007, at 2:14 PM, Rebhan, Gilbert wrote:
>
>> I want to check the file encoding of files in a directory.
>
> Have you tried charguess?
>
> http://raa.ruby-lang.org/project...

No, how to install it ?

only =

charguess.c
extconf.rb
MANIFEST
sample.rb

in the tarfile.

Regards, Gilbert

comp.lang.ruby

Detect file encoding utf-8

Rebhan, Gilbert

Richard Conroy

Xavier Noria

Gilbert Rebhan

x Login to ForumsZone