Asp Forum - Some questions about decode/encode

rong.xian

1/24/2008 3:49:00 AM

I use chinese charactors as an example here.

>>>s1='???'
>>>repr(s1)
"'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>>>b1=s1.decode('GBK')

My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?

My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

I'm not sure if some special byte array or too long text caused this
problem. Or maybe thats a BUG of python 2.5?

23 Answers

Ben Finney

1/24/2008 5:03:00 AM

glacier <rong.xian@gmail.com> writes:

> I use chinese charactors as an example here.
>
> >>>s1='ä½ å¥½å?'
> >>>repr(s1)
> "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
> >>>b1=s1.decode('GBK')
>
> My first question is : what strategy does 'decode' use to tell the way
> to seperate the words. I mean since s1 is an multi-bytes-char string,
> how did it determine to seperate the string every 2bytes or 1byte?

The codec you specified ("GBK") is, like any character-encoding codec,
a precise mapping between characters and bytes. It's almost certainly
not aware of "words", only character-to-byte mappings.

--
\ "When I get new information, I change my position. What, sir, |
`\ do you do with new information?" -- John Maynard Keynes |
_o__) |
Ben Finney

Ben Finney

1/24/2008 5:41:00 AM

Ben Finney <bignose+hates-spam@benfinney.id.au> writes:

> glacier <rong.xian@gmail.com> writes:
>
> > I use chinese charactors as an example here.
> >
> > >>>s1='ä½ å¥½å?'
> > >>>repr(s1)
> > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
> > >>>b1=s1.decode('GBK')
> >
> > My first question is : what strategy does 'decode' use to tell the
> > way to seperate the words. I mean since s1 is an multi-bytes-char
> > string, how did it determine to seperate the string every 2bytes
> > or 1byte?
>
> The codec you specified ("GBK") is, like any character-encoding
> codec, a precise mapping between characters and bytes. It's almost
> certainly not aware of "words", only character-to-byte mappings.

To be clear, I should point out that I didn't mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn't make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney

bbtestingbb

1/24/2008 5:49:00 AM

On Jan 23, 8:49 pm, glacier <rong.x...@gmail.com> wrote:
> I use chinese charactors as an example here.
>
> >>>s1='???'
> >>>repr(s1)
>
> "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>
> >>>b1=s1.decode('GBK')
>
> My first question is : what strategy does 'decode' use to tell the way
> to seperate the words.

decode() uses the GBK strategy you specified to determine what
constitutes a character in your string.

> My second question is: is there any one who has tested very long mbcs
> decode? I tried to decode a long(20+MB) xml yesterday, which turns out
> to be very strange and cause SAX fail to parse the decoded string.
> However, I use another text editor to convert the file to utf-8 and
> SAX will parse the content successfully.
>
> I'm not sure if some special byte array or too long text caused this
> problem. Or maybe thats a BUG of python 2.5?

That's probably to vague of a description to determine why SAX isn't
doing what you expect it to.

rong.xian

1/24/2008 6:52:00 AM

On 1?24?, ??1?41?, Ben Finney <bignose+hates-s....@benfinney.id.au>
wrote:
> Ben Finney <bignose+hates-s...@benfinney.id.au> writes:
> > glacier <rong.x...@gmail.com> writes:
>
> > > I use chinese charactors as an example here.
>
> > > >>>s1='???'
> > > >>>repr(s1)
> > > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
> > > >>>b1=s1.decode('GBK')
>
> > > My first question is : what strategy does 'decode' use to tell the
> > > way to seperate the words. I mean since s1 is an multi-bytes-char
> > > string, how did it determine to seperate the string every 2bytes
> > > or 1byte?
>
> > The codec you specified ("GBK") is, like any character-encoding
> > codec, a precise mapping between characters and bytes. It's almost
> > certainly not aware of "words", only character-to-byte mappings.
>
> To be clear, I should point out that I didn't mean to imply static
> tabular mappings only. The mappings in a character encoding are often
> more complex and algorithmic.
>
> That doesn't make them any less precise, of course; and the core point
> is that a character-mapping codec is *only* about getting between
> characters and bytes, nothing else.
>
> --
> \ "He who laughs last, thinks slowest." -- Anonymous |
> `\ |
> _o__) |
> Ben Finney- ??????? -
>
> - ??????? -

thanks for your respoonse:)

When I mentioned 'word' in the previous post, I mean character.
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='???'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################

May the code above produce any bogus characters in s1?

Thanks :)

rong.xian

1/24/2008 7:29:00 AM

On 1?24?, ??1?49?, bbtestin...@gmail.com wrote:
> On Jan 23, 8:49 pm, glacier <rong.x...@gmail.com> wrote:
>
> > I use chinese charactors as an example here.
>
> > >>>s1='???'
> > >>>repr(s1)
>
> > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>
> > >>>b1=s1.decode('GBK')
>
> > My first question is : what strategy does 'decode' use to tell the way
> > to seperate the words.
>
> decode() uses the GBK strategy you specified to determine what
> constitutes a character in your string.
>
> > My second question is: is there any one who has tested very long mbcs
> > decode? I tried to decode a long(20+MB) xml yesterday, which turns out
> > to be very strange and cause SAX fail to parse the decoded string.
> > However, I use another text editor to convert the file to utf-8 and
> > SAX will parse the content successfully.
>
> > I'm not sure if some special byte array or too long text caused this
> > problem. Or maybe thats a BUG of python 2.5?
>
> That's probably to vague of a description to determine why SAX isn't
> doing what you expect it to.

You mean to post a copy of the XML document?

Gabriel Genellina

1/24/2008 7:30:00 AM

En Thu, 24 Jan 2008 04:52:22 -0200, glacier <rong.xian@gmail.com> escribiÃ³:

> According to your reply, what will happen if I try to decode a long
> string seperately.
> I mean:
> ######################################
> a='ä½ å¥½å?'*100000
> s1 = u''
> cur = 0
> while cur < len(a):
> d = min(len(a)-i,1023)
> s1 += a[cur:cur+d].decode('mbcs')
> cur += d
> ######################################
>
> May the code above produce any bogus characters in s1?

Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see
http://docs.python.org/lib/string-methods.ht...

--
Gabriel Genellina

Marc 'BlackJack' Rintsch

1/24/2008 8:44:00 AM

On Wed, 23 Jan 2008 19:49:01 -0800, glacier wrote:

> My second question is: is there any one who has tested very long mbcs
> decode? I tried to decode a long(20+MB) xml yesterday, which turns out
> to be very strange and cause SAX fail to parse the decoded string.

That's because SAX wants bytes, not a decoded string. Don't decode it
yourself.

> However, I use another text editor to convert the file to utf-8 and
> SAX will parse the content successfully.

Because now you feed SAX with bytes instead of a unicode string.

Ciao,
Marc 'BlackJack' Rintsch

John Machin

1/24/2008 9:52:00 AM

On Jan 24, 2:49 pm, glacier <rong.x...@gmail.com> wrote:
> I use chinese charactors as an example here.
>
> >>>s1='???'
> >>>repr(s1)
>
> "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>
> >>>b1=s1.decode('GBK')
>
> My first question is : what strategy does 'decode' use to tell the way
> to seperate the words. I mean since s1 is an multi-bytes-char string,
> how did it determine to seperate the string every 2bytes or 1byte?
>

The usual strategy for encodings like GBK is:
1. If the current byte is less than 0x80, then it's a 1-byte
character.
2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte
make up a two-byte character.
3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte
euro character)
4: Current byte 0xFF: undefined

Cheers,
John

7stud --

1/24/2008 4:13:00 PM

On Jan 24, 1:44 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
> On Wed, 23 Jan 2008 19:49:01 -0800, glacier wrote:
> > My second question is: is there any one who has tested very long mbcs
> > decode? I tried to decode a long(20+MB) xml yesterday, which turns out
> > to be very strange and cause SAX fail to parse the decoded string.
>
> That's because SAX wants bytes, not a decoded string. Don't decode it
> yourself.
>

encode() converts a unicode string to a regular string. decode()
converts a regular string to a unicode string. So I think what Marc
is saying is that SAX needs a regular string(i.e. bytes) not a decoded
string(i.e. a unicode string).

rong.xian

1/27/2008 10:17:00 AM

On 1?24?, ??3?29?, "Gabriel Genellina" <gagsl-....@yahoo.com.ar> wrote:
> En Thu, 24 Jan 2008 04:52:22 -0200, glacier <rong.x...@gmail.com> escribió:
>
> > According to your reply, what will happen if I try to decode a long
> > string seperately.
> > I mean:
> > ######################################
> > a='???'*100000
> > s1 = u''
> > cur = 0
> > while cur < len(a):
> > d = min(len(a)-i,1023)
> > s1 += a[cur:cur+d].decode('mbcs')
> > cur += d
> > ######################################
>
> > May the code above produce any bogus characters in s1?
>
> Don't do that. You might be splitting the input string at a point that is
> not a character boundary. You won't get bogus output, decode will raise a
> UnicodeDecodeError instead.
> You can control how errors are handled, see http://docs.python.org/lib/string-methods.ht...
>
> --
> Gabriel Genellina

Thanks Gabriel,

I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?

Thanks a lot.

comp.lang.python

Some questions about decode/encode

rong.xian

Ben Finney

Ben Finney

bbtestingbb

rong.xian

rong.xian

Gabriel Genellina

Marc 'BlackJack' Rintsch

John Machin

7stud --

rong.xian

x Login to ForumsZone