[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Malformed UTF-8?

Ian Macdonald

3/11/2005 12:05:00 AM

Hello,

We have a commercial calendaring application at work that conveniently
offers a C API. I have wrapped this API in the form of
Ruby/CorporateTime.

Recently, we've started to see ArgumentError exceptions being thrown by
the library, as it discovers calendar events that it believes to contain
malformed UTF-8.

One such allegedly bad string is the following:

irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character
from (irb):1:in `unpack'
from (irb):1

This is supposed to be Japanese. Can a Japanese reader please confirm
that this is, indeed, malformed UTF-8? I need to be sure that the bug
does not lie with Ruby before I get back to our calendar admin and tell
him to go and pester Oracle.

Thanks,

Ian
--
Ian Macdonald | He who has the courage to laugh is almost
System Administrator | as much a master of the world as he who is
ian@caliban.org | ready to die. -- Giacomo Leopardi
http://www.c... |
|


4 Answers

Simon Strandgaard

3/11/2005 12:33:00 AM

0

On Fri, 11 Mar 2005 09:05:11 +0900, Ian Macdonald <ian@caliban.org> wrote:
> One such allegedly bad string is the following:
>
> irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
> ArgumentError: malformed UTF-8 character
> from (irb):1:in `unpack'
> from (irb):1
>
> This is supposed to be Japanese. Can a Japanese reader please confirm
> that this is, indeed, malformed UTF-8? I need to be sure that the bug
> does not lie with Ruby before I get back to our calendar admin and tell
> him to go and pester Oracle.


the substring "\210\004" is invalid UTF8.
in hex its [0x88, 0x04].

0x88 has its uppermost bit set, so this is a dual byte sequence.
0x04 is not a valid continuation byte (upper bit should have been 1).

--
Simon Strandgaard


Simon Strandgaard

3/11/2005 12:37:00 AM

0

On Fri, 11 Mar 2005 01:33:15 +0100, Simon Strandgaard <neoneye@gmail.com> wrote:
> On Fri, 11 Mar 2005 09:05:11 +0900, Ian Macdonald <ian@caliban.org> wrote:
> > One such allegedly bad string is the following:
> >
> > irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
> > ArgumentError: malformed UTF-8 character
> > from (irb):1:in `unpack'
> > from (irb):1
> >
> > This is supposed to be Japanese. Can a Japanese reader please confirm
> > that this is, indeed, malformed UTF-8? I need to be sure that the bug
> > does not lie with Ruby before I get back to our calendar admin and tell
> > him to go and pester Oracle.
>
> the substring "\210\004" is invalid UTF8.
> in hex its [0x88, 0x04].
>
> 0x88 has its uppermost bit set, so this is a dual byte sequence.
> 0x04 is not a valid continuation byte (upper bit should have been 1).

Forget this explanaition, its wrong.. (I mis-read my testcase)


0x88 is not a valid first-byte for a sequence.
In order to be a valid first-byte, then the 2 upper most bits must be set.
0x88 only has one bit set.

--
Simon Strandgaard


Nikolai Weibull

3/11/2005 1:05:00 AM

0

* Ian Macdonald (Mar 11, 2005 01:30):
> irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
> ArgumentError: malformed UTF-8 character from (irb):1:in `unpack' from
> (irb):1

utf8validate.rb:

--- cut here ---
#! /usr/bin/ruby -w

ARGV[0] =~ /^(
[\x00-\x7F] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*/x

if $~.end(0) != ARGV[0].length
printf("malformed UTF-8 character starting at position %d in the input\n", $~.end(0))
exit 1
end
--- cut here ---

and from zsh:

% utf8validate.rb $'p\210\004n\306\271\310gY\002'
malformed UTF-8 character starting at position 2 in the input
%

For your input, the \210 is wrong, as this regex won't allow it. I'm
not 100% sure that this is actually correct, as I haven't verified that
the regular expression is correct, but I'm guessing it is. Anyway, now
you can tell where in the data things blow up,
nikolai

--
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}


Ian Macdonald

3/11/2005 8:37:00 AM

0

On Fri 11 Mar 2005 at 10:05:26 +0900, Nikolai Weibull wrote:

> * Ian Macdonald (Mar 11, 2005 01:30):
> > irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
> > ArgumentError: malformed UTF-8 character from (irb):1:in `unpack' from
> > (irb):1
>
> utf8validate.rb:
>
> --- cut here ---
> #! /usr/bin/ruby -w
>
> ARGV[0] =~ /^(
> [\x00-\x7F] # ASCII
> | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
> )*/x
>
> if $~.end(0) != ARGV[0].length
> printf("malformed UTF-8 character starting at position %d in the input\n", $~.end(0))
> exit 1
> end
> --- cut here ---
>
> and from zsh:
>
> % utf8validate.rb $'p\210\004n\306\271\310gY\002'
> malformed UTF-8 character starting at position 2 in the input
> %
>
> For your input, the \210 is wrong, as this regex won't allow it. I'm
> not 100% sure that this is actually correct, as I haven't verified that
> the regular expression is correct, but I'm guessing it is. Anyway, now
> you can tell where in the data things blow up,
> nikolai

My thanks to you and Simon. It's especially nice to see a formal
definition of UTF-8 encapsulated in your regex. I wasn't aware of the
formal definition until someone at work pointed me at this excellent
resource:

http://en.wikipedia.org/...

Ian
--
Ian Macdonald | Arrakis teaches the attitude of the knife -
System Administrator | chopping off what's incomplete and saying:
ian@caliban.org | "Now it's complete because it's ended
http://www.c... | here." -- Muad'dib, "Dune"
|