Asp Forum - Is unicode.lower() locale-independent?

Robert Kern

1/12/2008 9:26:00 AM

The section on "String Methods"[1] in the Python documentation states that for
the case conversion methods like str.lower(), "For 8-bit strings, this method is
locale-dependent." Is there a guarantee that unicode.lower() is
locale-*in*dependent?

The section on "Case Conversion" in PEP 100 suggests this, but the code itself
looks like to may call the C function towlower() if it is available. On OS X
Leopard, the manpage for towlower(3) states that it "uses the current locale"
though it doesn't say exactly *how* it uses it.

This is the bug I'm trying to fix:

http://scipy.org/scipy/numpy/...
http://dev.laptop.org/t...

[1] http://docs.python.org/lib/string-me...
[2] http://www.python.org/dev/peps...

Thanks.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

14 Answers

John Machin

1/12/2008 10:46:00 AM

On Jan 12, 8:25 pm, Robert Kern <robert.k...@gmail.com> wrote:
> The section on "String Methods"[1] in the Python documentation states that for
> the case conversion methods like str.lower(), "For 8-bit strings, this method is
> locale-dependent." Is there a guarantee that unicode.lower() is
> locale-*in*dependent?
>
> The section on "Case Conversion" in PEP 100 suggests this, but the code itself
> looks like to may call the C function towlower() if it is available. On OS X
> Leopard, the manpage for towlower(3) states that it "uses the current locale"
> though it doesn't say exactly *how* it uses it.
>
> This is the bug I'm trying to fix:
>
> http://scipy.org/scipy/numpy/...
> http://dev.laptop.org/t...
>
> [1]http://docs.python.org/lib/string-me...
> [2]http://www.python.org/dev/peps...
>

The Unicode standard says that case mappings are language-dependent.
It gives the example of the Turkish dotted capital letter I and
dotless small letter i that "caused" the numpy problem. See
http://www.unicode.org/versions/Unicode4.0.0/ch05....

Here is what the Python 2.5.1 unicode implementation does in an
English-language locale:

>>> import unicodedata as ucd
>>> eyes = u"Ii\u0130\u0131"
>>> for eye in eyes:
.... print repr(eye), ucd.name(eye)
....
u'I' LATIN CAPITAL LETTER I
u'i' LATIN SMALL LETTER I
u'\u0130' LATIN CAPITAL LETTER I WITH DOT ABOVE
u'\u0131' LATIN SMALL LETTER DOTLESS I
>>> for eye in eyes:
.... print "%r %r %r %r" % (eye, eye.upper(), eye.lower(),
eye.capitalize())
....
u'I' u'I' u'i' u'I'
u'i' u'I' u'i' u'I'
u'\u0130' u'\u0130' u'i' u'\u0130'
u'\u0131' u'I' u'\u0131' u'I'

The conversions for I and i are not correct for a Turkish locale.

I don't know how to repeat the above in a Turkish locale.

However it appears from your bug ticket that you have a much narrower
problem (case-shifting a small known list of English words like VOID)
and can work around it by writing your own locale-independent casing
functions. Do you still need to find out whether Python unicode
casings are locale-dependent?

Cheers,
John

Robert Kern

1/12/2008 11:39:00 AM

John Machin wrote:
> On Jan 12, 8:25 pm, Robert Kern <robert.k...@gmail.com> wrote:
>> The section on "String Methods"[1] in the Python documentation states that for
>> the case conversion methods like str.lower(), "For 8-bit strings, this method is
>> locale-dependent." Is there a guarantee that unicode.lower() is
>> locale-*in*dependent?
>>
>> The section on "Case Conversion" in PEP 100 suggests this, but the code itself
>> looks like to may call the C function towlower() if it is available. On OS X
>> Leopard, the manpage for towlower(3) states that it "uses the current locale"
>> though it doesn't say exactly *how* it uses it.
>>
>> This is the bug I'm trying to fix:
>>
>> http://scipy.org/scipy/numpy/...
>> http://dev.laptop.org/t...
>>
>> [1]http://docs.python.org/lib/string-me...
>> [2]http://www.python.org/dev/peps...
>
> The Unicode standard says that case mappings are language-dependent.
> It gives the example of the Turkish dotted capital letter I and
> dotless small letter i that "caused" the numpy problem. See
> http://www.unicode.org/versions/Unicode4.0.0/ch05....

That doesn't determine the behavior of unicode.lower(), I don't think. That
specifies semantics for when one is dealing with a given language in the
abstract. That doesn't specify concrete behavior with respect to a given locale
setting on a real computer. For example, my strings 'VOID', 'INT', etc. are all
English, and I want English case behavior. The language of the data and the
transformations I want to apply to the data is English even though the user may
have set the locale to something else.

> Here is what the Python 2.5.1 unicode implementation does in an
> English-language locale:
>
>>>> import unicodedata as ucd
>>>> eyes = u"Ii\u0130\u0131"
>>>> for eye in eyes:
> ... print repr(eye), ucd.name(eye)
> ...
> u'I' LATIN CAPITAL LETTER I
> u'i' LATIN SMALL LETTER I
> u'\u0130' LATIN CAPITAL LETTER I WITH DOT ABOVE
> u'\u0131' LATIN SMALL LETTER DOTLESS I
>>>> for eye in eyes:
> ... print "%r %r %r %r" % (eye, eye.upper(), eye.lower(),
> eye.capitalize())
> ...
> u'I' u'I' u'i' u'I'
> u'i' u'I' u'i' u'I'
> u'\u0130' u'\u0130' u'i' u'\u0130'
> u'\u0131' u'I' u'\u0131' u'I'
>
> The conversions for I and i are not correct for a Turkish locale.
>
> I don't know how to repeat the above in a Turkish locale.

If you have the correct locale data in your operating system, this should be
sufficient, I believe:

$ LANG=tr_TR python
Python 2.4.3 (#1, Mar 14 2007, 19:01:42)
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, '')
'tr_TR'
>>> 'VOID'.lower()
'vo\xfdd'
>>> 'VOID'.lower().decode('iso-8859-9')
u'vo\u0131d'
>>> u'VOID'.lower()
u'void'
>>>

> However it appears from your bug ticket that you have a much narrower
> problem (case-shifting a small known list of English words like VOID)
> and can work around it by writing your own locale-independent casing
> functions. Do you still need to find out whether Python unicode
> casings are locale-dependent?

I would still like to know. There are other places where .lower() is used in
numpy, not to mention the rest of my code.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Fredrik Lundh

1/12/2008 11:51:00 AM

Robert Kern wrote:

>> However it appears from your bug ticket that you have a much narrower
>> problem (case-shifting a small known list of English words like VOID)
>> and can work around it by writing your own locale-independent casing
>> functions. Do you still need to find out whether Python unicode
>> casings are locale-dependent?
>
> I would still like to know. There are other places where .lower() is used in
> numpy, not to mention the rest of my code.

"lower" uses the informative case mappings provided by the Unicode
character database; see

http://www.unicode.org/Public/4.1.0/uc...

afaik, changing the locale has no influence whatsoever on Python's
Unicode subsystem.

</F>

Torsten Bronger

1/12/2008 12:27:00 PM

Hallöchen!

Fredrik Lundh writes:

> Robert Kern wrote:
>
>>> However it appears from your bug ticket that you have a much
>>> narrower problem (case-shifting a small known list of English
>>> words like VOID) and can work around it by writing your own
>>> locale-independent casing functions. Do you still need to find
>>> out whether Python unicode casings are locale-dependent?
>>
>> I would still like to know. There are other places where .lower()
>> is used in numpy, not to mention the rest of my code.
>
> "lower" uses the informative case mappings provided by the Unicode
> character database; see
>
> http://www.unicode.org/Public/4.1.0/uc...
>
> afaik, changing the locale has no influence whatsoever on Python's
> Unicode subsystem.

Slightly off-topic because it's not part of the Unicode subsystem,
but I was once irritated that the none-breaking space (codepoint xa0
I think) was included into string.whitespace. I cannot reproduce it
on my current system anymore, but I was pretty sure it occured with
a fr_FR.UTF-8 locale. Is this possible? And who is to blame, or
must my program cope with such things?

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: bronger@jabber.org
(See http://ime.... for further contact info.)

John Machin

1/12/2008 9:12:00 PM

On Jan 12, 10:51 pm, Fredrik Lundh <fred...@pythonware.com> wrote:
> Robert Kern wrote:
> >> However it appears from your bug ticket that you have a much narrower
> >> problem (case-shifting a small known list of English words like VOID)
> >> and can work around it by writing your own locale-independent casing
> >> functions. Do you still need to find out whether Python unicode
> >> casings are locale-dependent?
>
> > I would still like to know. There are other places where .lower() is used in
> > numpy, not to mention the rest of my code.
>
> "lower" uses the informative case mappings provided by the Unicode
> character database; see
>
> http://www.unicode.org/Public/4.1.0/uc...

of which the relevant part is
"""
Case Mappings

There are a number of complications to case mappings that occur once
the repertoire of characters is expanded beyond ASCII. For more
information, see Chapter 3 in Unicode 4.0.

For compatibility with existing parsers, UnicodeData.txt only contains
case mappings for characters where they are one-to-one mappings; it
also omits information about context-sensitive case mappings.
Information about these special cases can be found in a separate data
file, SpecialCasing.txt.
"""

It seems that Python doesn't use the SpecialCasing.txt file. Effects
include:
(a) one-to-many mappings don't happen e.g. LATIN SMALL LETTER SHARP S:
u'\xdf'.upper() produces u'\xdf' instead of u'SS'
(b) language-sensitive mappings (e.g. dotted/dotless I/i for Turkish
(and Azeri)) don't happen
(c) context-sensitive mappings don't happen e.g. lower case of GREEK
CAPITAL LETTER SIGMA depends on whether it is the last letter in a
word.

>
> afaik, changing the locale has no influence whatsoever on Python's
> Unicode subsystem.
>
> </F>

John Machin

1/12/2008 9:51:00 PM

On Jan 12, 11:26 pm, Torsten Bronger <bron...@physik.rwth-aachen.de>
wrote:
> Hallöchen!
>
>
>
> Fredrik Lundh writes:
> > Robert Kern wrote:
>
> >>> However it appears from your bug ticket that you have a much
> >>> narrower problem (case-shifting a small known list of English
> >>> words like VOID) and can work around it by writing your own
> >>> locale-independent casing functions. Do you still need to find
> >>> out whether Python unicode casings are locale-dependent?
>
> >> I would still like to know. There are other places where .lower()
> >> is used in numpy, not to mention the rest of my code.
>
> > "lower" uses the informative case mappings provided by the Unicode
> > character database; see
>
> > http://www.unicode.org/Public/4.1.0/uc...
>
> > afaik, changing the locale has no influence whatsoever on Python's
> > Unicode subsystem.
>
> Slightly off-topic because it's not part of the Unicode subsystem,
> but I was once irritated that the none-breaking space (codepoint xa0
> I think) was included into string.whitespace. I cannot reproduce it
> on my current system anymore, but I was pretty sure it occured with
> a fr_FR.UTF-8 locale. Is this possible? And who is to blame, or
> must my program cope with such things?

The NO-BREAK SPACE is treated as whitespace in the Python unicode
subsystem. As for str objects, the default "C" locale doesn't know it
exists; otherwise AFAIK if the character set for the locale has it, it
will be treated as whitespace.

You were irritated because non-break SPACE was included in
string.whiteSPACE? Surely not! It seems eminently logical to me.
Perhaps you were irritated because str.split() ignored the "no-break"?
If like me you had been faced with removing trailing spaces from text
columns in databases, you surely would have been delighted that
str.rstrip() removed the trailing-padding-for-nicer-layout no-break
spaces that the users had copy/pasted from some clown's website :-)

What was the *real* cause of your irritation?

Robert Kern

1/12/2008 10:44:00 PM

Fredrik Lundh wrote:
> Robert Kern wrote:
>
>>> However it appears from your bug ticket that you have a much narrower
>>> problem (case-shifting a small known list of English words like VOID)
>>> and can work around it by writing your own locale-independent casing
>>> functions. Do you still need to find out whether Python unicode
>>> casings are locale-dependent?
>> I would still like to know. There are other places where .lower() is used in
>> numpy, not to mention the rest of my code.
>
> "lower" uses the informative case mappings provided by the Unicode
> character database; see
>
> http://www.unicode.org/Public/4.1.0/uc...
>
> afaik, changing the locale has no influence whatsoever on Python's
> Unicode subsystem.

Even if towlower() gets used? I've found an explicit statement that the
conversion it does can be locale-specific:

http://msdn2.microsoft.com/en-us/library/8h1...

Thanks, Fredrik.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Carl Banks

1/12/2008 10:50:00 PM

On Sat, 12 Jan 2008 13:51:18 -0800, John Machin wrote:

> On Jan 12, 11:26 pm, Torsten Bronger <bron...@physik.rwth-aachen.de>
> wrote:
>> HallÃ¶chen!
>>
>>
>>
>> Fredrik Lundh writes:
>> > Robert Kern wrote:
>>
>> >>> However it appears from your bug ticket that you have a much
>> >>> narrower problem (case-shifting a small known list of English words
>> >>> like VOID) and can work around it by writing your own
>> >>> locale-independent casing functions. Do you still need to find out
>> >>> whether Python unicode casings are locale-dependent?
>>
>> >> I would still like to know. There are other places where .lower() is
>> >> used in numpy, not to mention the rest of my code.
>>
>> > "lower" uses the informative case mappings provided by the Unicode
>> > character database; see
>>
>> > http://www.unicode.org/Public/4.1.0/uc...
>>
>> > afaik, changing the locale has no influence whatsoever on Python's
>> > Unicode subsystem.
>>
>> Slightly off-topic because it's not part of the Unicode subsystem, but
>> I was once irritated that the none-breaking space (codepoint xa0 I
>> think) was included into string.whitespace. I cannot reproduce it on
>> my current system anymore, but I was pretty sure it occured with a
>> fr_FR.UTF-8 locale. Is this possible? And who is to blame, or must my
>> program cope with such things?
>
> The NO-BREAK SPACE is treated as whitespace in the Python unicode
> subsystem. As for str objects, the default "C" locale doesn't know it
> exists; otherwise AFAIK if the character set for the locale has it, it
> will be treated as whitespace.
>
> You were irritated because non-break SPACE was included in
> string.whiteSPACE? Surely not! It seems eminently logical to me.

To me it seems the point of a non-breaking space is to have something
that's printed as whitespace but not treated as it.

> Perhaps
> you were irritated because str.split() ignored the "no-break"? If like
> me you had been faced with removing trailing spaces from text columns in
> databases, you surely would have been delighted that str.rstrip()
> removed the trailing-padding-for-nicer-layout no-break spaces that the
> users had copy/pasted from some clown's website :-)
>
> What was the *real* cause of your irritation?

If you want to use str.split() to split words, you will foil the user who
wants to not break at a certain point.

Your use of rstrip() is a lot more specialized, if you ask me.

Carl Banks

Martin v. Loewis

1/12/2008 11:32:00 PM

> The Unicode standard says that case mappings are language-dependent.

I think you are misreading it. 5.18 "Implementation Guides" says
(talking about "most environments") "In such cases, the
language-specific mappings *must not* be used." (emphasis also
in the original spec).

Regards,
Martin

Martin v. Loewis

1/12/2008 11:32:00 PM

> Even if towlower() gets used? I've found an explicit statement that the
> conversion it does can be locale-specific:
>
> http://msdn2.microsoft.com/en-us/library/8h1...

Right. However, the build option of Python where that's the case is
deprecated.

Regards,
Martin

comp.lang.python

Is unicode.lower() locale-independent?

Robert Kern

John Machin

Robert Kern

Fredrik Lundh

Torsten Bronger

John Machin

John Machin

Robert Kern

Carl Banks

Martin v. Loewis

Martin v. Loewis

x Login to ForumsZone