7stud --
2/18/2008 11:53:00 AM
On Feb 18, 3:20 am, William Heymann <k...@aesaeion.com> wrote:
> How do I decode a string back to useful unicode that has xml numeric character
> references in it?
>
> Things like 占
BeautifulSoup can handle two of the three formats for html entities.
For instance, an 'o' with umlaut can be represented in three different
ways:
&_ouml_;
ö
ö
BeautifulSoup can convert the first two formats to unicode:
from BeautifulSoup import BeautifulStoneSoup as BSS
my_string = '占'
soup = BSS(my_string, convertEntities=BSS.XML_ENTITIES)
print soup.contents[0].encode('utf-8')
print soup.contents[0]
--output:---
<some asian looking character>
Traceback (most recent call last):
File "test1.py", line 6, in ?
print soup.contents[0]
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5360' in
position 0: ordinal not in range(128)
The error message shows you the unicode string that BeautifulSoup
produced: u'\u5360'
If that won't work for you, it's not hard to write you own conversion
function to handle all three formats:
1) Create a regex that will match any of the formats
2) Convert the first format using htmlentitydefs.name2codepoint
3) Convert the second format using unichar()
4) Convert the third format using int('0'+ match, 16) and then
unichar()