Asp Forum - decode Numeric Character References to unicode

William Heymann

2/18/2008 10:20:00 AM

How do I decode a string back to useful unicode that has xml numeric character
references in it?

Things like 占

5 Answers

Duncan Booth

2/18/2008 11:17:00 AM

William Heymann <kosh@aesaeion.com> wrote:

> How do I decode a string back to useful unicode that has xml numeric
> character references in it?
>
> Things like 占
>
Try something like this:

import re
from htmlentitydefs import name2codepoint

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
def unescape(match):
code = match.group(1)
if code:
return unichr(int(code, 10))
else:
code = match.group(2)
if code:
return unichr(int(code, 16))
else:
code = match.group(3)
if code in name2codepoint:
return unichr(name2codepoint[code])
return match.group(0)

return EntityPattern.sub(unescape, s.decode(encoding))

Obviously if you really do only want numeric references you can take out
the lines using name2codepoint and simplify the regex.

7stud --

2/18/2008 11:53:00 AM

On Feb 18, 3:20 am, William Heymann <k...@aesaeion.com> wrote:
> How do I decode a string back to useful unicode that has xml numeric character
> references in it?
>
> Things like 占

BeautifulSoup can handle two of the three formats for html entities.
For instance, an 'o' with umlaut can be represented in three different
ways:

&_ouml_;
ö
ö

BeautifulSoup can convert the first two formats to unicode:

from BeautifulSoup import BeautifulStoneSoup as BSS

my_string = '占'
soup = BSS(my_string, convertEntities=BSS.XML_ENTITIES)
print soup.contents[0].encode('utf-8')
print soup.contents[0]

--output:---
<some asian looking character>

Traceback (most recent call last):
File "test1.py", line 6, in ?
print soup.contents[0]
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5360' in
position 0: ordinal not in range(128)

The error message shows you the unicode string that BeautifulSoup
produced: u'\u5360'

If that won't work for you, it's not hard to write you own conversion
function to handle all three formats:

1) Create a regex that will match any of the formats
2) Convert the first format using htmlentitydefs.name2codepoint
3) Convert the second format using unichar()
4) Convert the third format using int('0'+ match, 16) and then
unichar()

7stud --

2/18/2008 12:00:00 PM

On Feb 18, 4:53 am, 7stud <bbxx789_0...@yahoo.com> wrote:
> On Feb 18, 3:20 am, William Heymann <k...@aesaeion.com> wrote:
>
> > How do I decode a string back to useful unicode that has xml numeric character
> > references in it?
>
> > Things like 占 #which is: &_#21344_; (without the underscores)
>
> BeautifulSoup can handle two of the three formats for html entities.
> For instance, an 'o' with umlaut can be represented in three different
> ways:
>
> &_ouml_;
> ö
> ö
>

lol. It's hard to even make posts about this stuff because html
entities get converted by the forum software. Here are the three
different formats for an 'o with umlaut' with some underscores added
to keep the forum software from rendering the characters:

&_ouml_;
&_#246_;
&_#xf6_;

Duncan Booth

2/18/2008 12:10:00 PM

7stud <bbxx789_05ss@yahoo.com> wrote:

> On Feb 18, 4:53 am, 7stud <bbxx789_0...@yahoo.com> wrote:
>> On Feb 18, 3:20 am, William Heymann <k...@aesaeion.com> wrote:
>>
>> > How do I decode a string back to useful unicode that has xml
>> > numeric cha
> racter
>> > references in it?
>>
>> > Things like 占 #which is: &_#21344_; (without the
>> > underscores)
>>
>> BeautifulSoup can handle two of the three formats for html entities.
>> For instance, an 'o' with umlaut can be represented in three
>> different ways:
>>
>> &_ouml_;
>> ö
>> ö
>>
>
> lol. It's hard to even make posts about this stuff because html
> entities get converted by the forum software. Here are the three
> different formats for an 'o with umlaut' with some underscores added
> to keep the forum software from rendering the characters:
>
> &_ouml_;
> &_#246_;
> &_#xf6_;

FWIW, your original post was fine, it was just the quoted text in your
followup that was wrong.

I guess that is yet another reason to use a real newsreader or the mailing
list rather than Google Groups.

Ben Finney

2/18/2008 12:31:00 PM

7stud <bbxx789_05ss@yahoo.com> writes:

> For instance, an 'o' with umlaut can be represented in three
> different ways:
>
> '&' followed by 'ouml;'
> '&' followed by '#246;'
> '&' followed by '#xf6;'

The fourth way, of course, is to simply have 'Ã¶' appear directly as a
character in the document, and set the correct character encoding.
(Hint: UTF-8 is an excellent choice for "the correct character
encoding", if you get to choose.)

--
\ â??With Lisp or Forth, a master programmer has unlimited power |
`\ and expressiveness. With Python, even a regular guy can reach |
_o__) for the stars.â? â??Raymond Hettinger |
Ben Finney

comp.lang.python

decode Numeric Character References to unicode

William Heymann

Duncan Booth

7stud --

7stud --

Duncan Booth

Ben Finney

x Login to ForumsZone