Richard Conroy
8/29/2007 1:12:00 PM
You could use some regular expressions, to search for code points in
your source string that are outside of what is legal for UTF-8.
Basically you assume it is UTF-8, and then reject it if it contains illegal
or unknown code points.
On 8/29/07, Rebhan, Gilbert <Gilbert.Rebhan@huk-coburg.de> wrote:
>
> Hi,
>
> I want to check the file encoding of files in a directory.
> Until now i have tried =
>
> # found in an older thread in comp.lang.ruby
> class String
> def utf8?
> unpack('U*') rescue return false
> true
> end
> end
> # found in an older thread in comp.lang.ruby
>
> utf=Array.new
> others=Array.new
> Dir["Y:/test/**/*.xml"].each do |path|
> open(path) { |f|
> (f.read.utf8?) ? uts<<path : others<<path
> }
> end
>
> and also tried the chardet Library (no ruby documentation included)
> like that
>
> require 'UniversalDetector'
>
> utf=Array.new
> others=Array.new
> Dir["Y:/test/**/*.xml"].each do |path|
> open(path) { |f|
> UniversalDetector.chardet(f.read) =~ /utf-8/ ?
> uts<<path : others<<path
> }
> end
> puts utf.join(",")
> puts others.join(",")
>
>
> Are there better / simpler ways ?
>
> Regards, Gilbert
>
>
>
>