Ben Crowell
5/11/2008 10:15:00 PM
Alex Fenton wrote:
> Ben Crowell wrote:
>> I have some existing ruby 1.9 code that broke recently with a new build
>> of ruby. It looks like the problem was that my preexisting text input
>> files, which I'd been reading from stdin, contained some characters that
>> were not valid UTF-8 or US-ASCII.
>
> ...
>
>> I'm happy to change the input files, because it is an error that they
>> aren't properly encoded. However, I'd also like to find some way to test
>> for this type of error more gracefully, and I can't seem to figure out
>> how to do it.
>
> I use IConv in the standard library to convert from UTF8 to UTF8 to test
> whether files being imported by a user are in fact in the right
> encoding. This otherwise redundant recoding will raise a
> BadSequenceError if there's a problem. This can be caught and reported.
Thanks for the suggestion. However, I already have an error that I can
catch and report. The problem is that it's not very helpful to the user
to say, "hey, somewhere in your 100-page text file, there are illegal
characters." That's why I was trying to do this:
if t=~/([^\n]*[^\000-\177][^\n]*)/ then
$stderr.print "Bad ASCII character detected in this line:\n#{$1}\n"
end
It seems to me that I need some way to convince Ruby that the string t
is in an encoding where all characters are a single byte, and it's ok
to have the high bit set. Then I could go ahead and use regexes to test
whether it contains any characters with the high bit set, and report
them properly. It just seems like the string, once I read it in, is
like the Medusa -- my program doesn't even dare take a peek at it for
fear of being turned to stone.