Nikolai Weibull
3/11/2005 1:05:00 AM
* Ian Macdonald (Mar 11, 2005 01:30):
> irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
> ArgumentError: malformed UTF-8 character from (irb):1:in `unpack' from
> (irb):1
utf8validate.rb:
--- cut here ---
#! /usr/bin/ruby -w
ARGV[0] =~ /^(
[\x00-\x7F] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*/x
if $~.end(0) != ARGV[0].length
printf("malformed UTF-8 character starting at position %d in the input\n", $~.end(0))
exit 1
end
--- cut here ---
and from zsh:
% utf8validate.rb $'p\210\004n\306\271\310gY\002'
malformed UTF-8 character starting at position 2 in the input
%
For your input, the \210 is wrong, as this regex won't allow it. I'm
not 100% sure that this is actually correct, as I haven't verified that
the regular expression is correct, but I'm guessing it is. Anyway, now
you can tell where in the data things blow up,
nikolai
--
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}