rong.xian
1/24/2008 6:52:00 AM
On 1?24?, ??1?41?, Ben Finney <bignose+hates-s....@benfinney.id.au>
wrote:
> Ben Finney <bignose+hates-s...@benfinney.id.au> writes:
> > glacier <rong.x...@gmail.com> writes:
>
> > > I use chinese charactors as an example here.
>
> > > >>>s1='???'
> > > >>>repr(s1)
> > > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
> > > >>>b1=s1.decode('GBK')
>
> > > My first question is : what strategy does 'decode' use to tell the
> > > way to seperate the words. I mean since s1 is an multi-bytes-char
> > > string, how did it determine to seperate the string every 2bytes
> > > or 1byte?
>
> > The codec you specified ("GBK") is, like any character-encoding
> > codec, a precise mapping between characters and bytes. It's almost
> > certainly not aware of "words", only character-to-byte mappings.
>
> To be clear, I should point out that I didn't mean to imply static
> tabular mappings only. The mappings in a character encoding are often
> more complex and algorithmic.
>
> That doesn't make them any less precise, of course; and the core point
> is that a character-mapping codec is *only* about getting between
> characters and bytes, nothing else.
>
> --
> \ "He who laughs last, thinks slowest." -- Anonymous |
> `\ |
> _o__) |
> Ben Finney- ??????? -
>
> - ??????? -
thanks for your respoonse:)
When I mentioned 'word' in the previous post, I mean character.
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='???'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################
May the code above produce any bogus characters in s1?
Thanks :)