Asp Forum - ElementTree.fromstring(unicode_html

globophobe

1/26/2008 2:11:00 AM

This is likely an easy problem; however, I couldn't think of
appropriate keywords for google:

Basically, I have some raw data that needs to be preprocessed before
it is saved to the database e.g.

In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
\u3044\r\n'

I need to turn this into an elementtree, but some of the data is
japanese whereas the rest is html. This string contains a .

In [2]: e = ET.fromstring('<data>%s</data>' % unicode_html)
In [2]: e.text
Out[3]: u'\u3055\u3080\u3044\uff0f\n\u3064\u3081\u305f\u3044\n'
In [4]: len(e)
Out[4]: 0

How can I decode the unicode html into a string that
ElementTree can understand?

2 Answers

John Machin

1/26/2008 3:09:00 AM

On Jan 26, 1:11 pm, globophobe <globoph...@gmail.com> wrote:
> This is likely an easy problem; however, I couldn't think of
> appropriate keywords for google:
>
> Basically, I have some raw data that needs to be preprocessed before
> it is saved to the database e.g.
>
> In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
> \u3044\r\n'
>
> I need to turn this into an elementtree, but some of the data is
> japanese whereas the rest is html. This string contains a .

>>> import unicodedata as ucd
>>> s = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f\u3044\r\n'
>>> [ucd.name(c) if ord(c) >= 128 else c for c in s]
['HIRAGANA LETTER SA', 'HIRAGANA LETTER MU', 'HIRAGANA LETTER I',
'FULLWIDTH SOLIDUS', u'\r', u'\n', 'HIRAGANA LETTER TU', 'HIRAGANA
LETTER ME', 'HIRAGANA LETTER TA', 'HIRAGANA LETTER I', u'\r', u'\n']
>>>

Where in there is the ??

Fredrik Lundh

1/27/2008 6:36:00 PM

globophobe wrote:

> In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
> \u3044\r\n'
>
> I need to turn this into an elementtree, but some of the data is
> japanese whereas the rest is html. This string contains a .

where? is an element, not a character. "\r" and "\n" are
characters, not elements.

If you want to build a tree where "\r\n" is replaced with a 
element, you can encode the string as UTF-8, use the replace method to
insert the element, and then call fromstring.

Alternatively, you can build the tree yourself:

import xml.etree.ElementTree as ET

unicode_html =
u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f\u3044\r\n'

parts = unicode_html.splitlines()

elem = ET.Element("data")
elem.text = parts[0]
for part in parts[1:]:
ET.SubElement(elem, "br").tail = part

print ET.tostring(elem)

</F>

comp.lang.python

ElementTree.fromstring(unicode_html

globophobe

John Machin

Fredrik Lundh

x Login to ForumsZone