Asp Forum
Home
|
Login
|
Register
|
Search
Forums
>
comp.lang.python
Unicode/UTF-8 confusion
Tom Stambaugh
3/15/2008 4:09:00 PM
I'm still confused about this, even after days of hacking at it. It's time I
asked for help. I understand that each of you knows more about Python,
Javascript, unicode, and programming than me, and I understand that each of
you has a higher SAT score than me. So please try and be gentle with your
responses.
I use simplejson to serialize html strings that the server is delivering to
a browser. Since the apostrophe is a string terminator in javascript, I need
to escape any apostrophe embedded in the html.
Just to be clear, the specific unicode character I'm struggling with is
described in Python as:
u'\N{APOSTROPHE}'}. It has a standardized utf-8 value (according to, for
example,
http://www.fileformat.info/info/unicode/char/0027...
) of
0x27.
This can be expressed in several common ways:
hex: 0x27
Python literal: u"\u0027"
Suppose I start with some test string that contains an embedded
apostrophe -- for example: u" ' ". I believe that the appropriate json
serialization of this is (presented as a list to eliminate notation
ambiguities):
['"', ' ', ' ', ' ', '\\', '\\', '0', '0', '2', '7', ' ', ' ', ' ', '"']
This is a 14-character utf-8 serialization of the above test string.
I know I can brute-force this, using something like the following:
def encode(aRawString):
aReplacement = ''.join(['\\', '0', '0', '2', '7'])
aCookedString = aRawString.replace("'", aReplacement)
answer = simplejson.dumps(aCookedString)
return answer
I can't even make mailers let me *TYPE* a string literal for the replacement
string without trying to turn it into an HTML link!
Anyway, I know that my "encode" function works, but it pains me to add that
"replace" call before *EVERY* invocation of the simplejson.dumps() method.
The reason I upgraded to 1.7.4 was to get the c-level speedup routine now
offered by simplejson -- yet the need to do this apostrophe escaping seems
to negate this advantage! Is there perhaps some combination of dumps keyword
arguments, python encode()/str() magic, or something similar that
accomplishes this same result?
What is the highest-performance way to get simplejson to emit the desired
serialization of the given test string?
1 Answer
Marc 'BlackJack' Rintsch
3/15/2008 4:57:00 PM
0
On Sat, 15 Mar 2008 12:09:19 -0400, Tom Stambaugh wrote:
> I'm still confused about this, even after days of hacking at it. It's time I
> asked for help. I understand that each of you knows more about Python,
> Javascript, unicode, and programming than me, and I understand that each of
> you has a higher SAT score than me. So please try and be gentle with your
> responses.
>
> I use simplejson to serialize html strings that the server is delivering to
> a browser. Since the apostrophe is a string terminator in javascript, I need
> to escape any apostrophe embedded in the html.
>
> Just to be clear, the specific unicode character I'm struggling with is
> described in Python as:
> u'\N{APOSTROPHE}'}. It has a standardized utf-8 value (according to, for
> example,
http://www.fileformat.info/info/unicode/char/0027...
) of
> 0x27.
>
> This can be expressed in several common ways:
> hex: 0x27
> Python literal: u"\u0027"
>
> Suppose I start with some test string that contains an embedded
> apostrophe -- for example: u" ' ". I believe that the appropriate json
> serialization of this is (presented as a list to eliminate notation
> ambiguities):
>
> ['"', ' ', ' ', ' ', '\\', '\\', '0', '0', '2', '7', ' ', ' ', ' ', '"']
>
> This is a 14-character utf-8 serialization of the above test string.
>
> I know I can brute-force this, using something like the following:
> def encode(aRawString):
> aReplacement = ''.join(['\\', '0', '0', '2', '7'])
> aCookedString = aRawString.replace("'", aReplacement)
> answer = simplejson.dumps(aCookedString)
> return answer
>
> I can't even make mailers let me *TYPE* a string literal for the replacement
> string without trying to turn it into an HTML link!
>
> Anyway, I know that my "encode" function works, but it pains me to add that
> "replace" call before *EVERY* invocation of the simplejson.dumps() method.
> The reason I upgraded to 1.7.4 was to get the c-level speedup routine now
> offered by simplejson -- yet the need to do this apostrophe escaping seems
> to negate this advantage! Is there perhaps some combination of dumps keyword
> arguments, python encode()/str() magic, or something similar that
> accomplishes this same result?
>
> What is the highest-performance way to get simplejson to emit the desired
> serialization of the given test string?
Somehow I don't get what you are after. The ' doesn't have to be escaped
at all if " are used to delimit the string. If ' are used as delimiters
then \' is a correct escaping. What is the problem with that!?
Ciao,
Marc 'BlackJack' Rintsch
Servizio di avviso nuovi messaggi
Ricevi direttamente nella tua mail i nuovi messaggi per
Unicode/UTF-8 confusion
Inserendo la tua e-mail nella casella sotto, riceverai un avviso tramite posta elettronica ogni volta che il motore di ricerca troverà un nuovo messaggio per te
Il servizio è completamente GRATUITO!
x
Login to ForumsZone
Login with Google
Login with E-Mail & Password