Asp Forum - Embedding a literal "\u" in a unicode raw string.

Romano Giannetti

2/25/2008 11:39:00 AM

Hi,

while writing some LaTeX preprocessing code, I stumbled into this problem: (I
have a -*- coding: utf-8 -*- line, obviously)

s = ur"aÃ±ado $\uparrow$"

Which gave an error because the \u escape is interpreted in raw unicode strings,
too. So I found that the only way to solve this is to write:

s = unicode(r"aÃ±ado $\uparrow$", "utf-8")

or

s = ur"aÃ±ado $\u005cuparrow$"

The second one is too ugly to live, while the first is at least acceptable; but
looking around the Python 3.0 doc, I saw that the first one will fail, too.

Am I doing something wrong here or there is another solution for this?

Romano

7 Answers

Diez B. Roggisch

2/25/2008 12:47:00 PM

Romano Giannetti wrote:

> Hi,
>
> while writing some LaTeX preprocessing code, I stumbled into this problem:
> (I have a -*- coding: utf-8 -*- line, obviously)
>
> s = ur"aÃ±ado $\uparrow$"
>
> Which gave an error because the \u escape is interpreted in raw unicode
> strings, too. So I found that the only way to solve this is to write:
>
> s = unicode(r"aÃ±ado $\uparrow$", "utf-8")
>
> or
>
> s = ur"aÃ±ado $\u005cuparrow$"
>
> The second one is too ugly to live, while the first is at least
> acceptable; but looking around the Python 3.0 doc, I saw that the first
> one will fail, too.
>
> Am I doing something wrong here or there is another solution for this?

Why don't you rid yourself of the raw-string? Then you need to do

s = u"anando $\\uparrow$"

which is considerably easier to read than both other variants above.

Diez

OKB (not okblacke)

2/25/2008 5:04:00 PM

Romano Giannetti wrote:

> Hi,
>
> while writing some LaTeX preprocessing code, I stumbled into this
> problem: (I have a -*- coding: utf-8 -*- line, obviously)
>
> s = ur"aÃ±ado $\uparrow$"
>
> Which gave an error because the \u escape is interpreted in raw
> unicode strings, too. So I found that the only way to solve this is
> to write:
>
> s = unicode(r"aÃ±ado $\uparrow$", "utf-8")
>
> or
>
> s = ur"aÃ±ado $\u005cuparrow$"
>
> The second one is too ugly to live, while the first is at least
> acceptable; but looking around the Python 3.0 doc, I saw that the
> first one will fail, too.
>
> Am I doing something wrong here or there is another solution for
> this?

I too encountered this problem, in the same situation (making
strings that contain LaTeX commands). One possibility is to separate
out just the bit that has the \u, and use string juxtaposition to attach
it to the others:

s = ur"aÃ±ado " u"$\\uparrow$"

It's not ideal, but I think it's easier to read than your solution
#2.

--
--OKB (not okblacke)
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is
no path, and leave a trail."
--author unknown

Romano Giannetti

2/25/2008 5:24:00 PM

On Feb 25, 6:03 pm, "OKB (not okblacke)"
<brenNOSPAMb...@NObrenSPAMbarn.net> wrote:
>
> I too encountered this problem, in the same situation (making
> strings that contain LaTeX commands). One possibility is to separate
> out just the bit that has the \u, and use string juxtaposition to attach
> it to the others:
>
> s = ur"añado " u"$\\uparrow$"
>
> It's not ideal, but I think it's easier to read than your solution
> #2.
>

Yes, I think I will do something like that, although... I really do
not understand why \x5c is not interpreted in a raw string but \u005c
is interpreted in a unicode raw string... is, well, not elegant. Raw
should be raw...

Thanks anyway

Martin v. Loewis

2/25/2008 10:28:00 PM

> Yes, I think I will do something like that, although... I really do
> not understand why \x5c is not interpreted in a raw string but \u005c
> is interpreted in a unicode raw string... is, well, not elegant. Raw
> should be raw...

Right. IMO, this is just a plain design mistake in the Python Unicode
handling. Unfortunately, there was discussion about this specific issue
in the past, and the proponent of the status quo always defended it,
with the rationale (IIUC) that a) without that, you can't put arbitrary
Unicode characters into a string, and b) the semantics of \u in Java and
C is so that \u gets processed even before tokenization even starts, and
it should be the same in Python.

Regards,
Martin

Romano Giannetti

2/25/2008 10:46:00 PM

On Feb 25, 11:27 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> > Raw
> > should be raw...
>
> Right. IMO, this is just a plain design mistake in the Python Unicode
> handling. Unfortunately, there was discussion about this specific issue
> in the past, and the proponent of the status quo always defended it,
> with the rationale (IIUC) that a) without that, you can't put arbitrary
> Unicode characters into a string, and b) the semantics of \u in Java and
> C is so that \u gets processed even before tokenization even starts, and
> it should be the same in Python.

Well, I do not know Java, but C AFAIK has no raw strings, so you have
nevertheless
to use double backslashes. Raw strings are a handy shorthand when you
can generate
the characters with your keyboard, and this asymmetry quite defeat it.

Is it decided or it is possible to lobby for it? :-)

Thanks,
Romano

BTW, 2to3.py should warn when a raw string (not unicode) with \u in
it, I think.
I tried it and it seems to ignore the problem...

Nick Coghlan

3/4/2008 12:00:00 PM

On Feb 26, 8:45 am, rmano <romano.gianne...@gmail.com> wrote:
> BTW, 2to3.py should warn when a raw string (not unicode) with \u in
> it, I think.
> I tried it and it seems to ignore the problem...

Python 3.0a3+ (py3k:61229, Mar 4 2008, 21:38:15)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> r"\u"
'\\u'
>>> r"\uparrow"
'\\uparrow'
>>> r"\u005c"
'\\u005c'
>>> r"\N{REVERSE SOLIDUS}"
'\\N{REVERSE SOLIDUS}'
>>> "\u005c"
'\\'
>>> "\N{REVERSE SOLIDUS}"
'\\'

2to3.py may be ignoring a problem, but existing raw 8-bit string
literals containing a '\u' aren't going to be it. If anything is going
to have a problem with conversion to Py3k at this point, it is raw
Unicode literals that contain a Unicode escape.

Romano Giannetti

3/7/2008 3:18:00 PM

On Mar 4, 1:00 pm, NickC <ncogh...@gmail.com> wrote:
>
> Python 3.0a3+ (py3k:61229, Mar 4 2008, 21:38:15)
> [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.>>> r"\u"
> '\\u'
> >>> r"\uparrow"
> '\\uparrow'

Nice to know... so it seems that the 3.0 doc was not updated. I think
this is the correct
behaviour. Thanks

comp.lang.python

Embedding a literal "\u" in a unicode raw string.

Romano Giannetti

Diez B. Roggisch

OKB (not okblacke)

Romano Giannetti

Martin v. Loewis

Romano Giannetti

Nick Coghlan

Romano Giannetti

x Login to ForumsZone