Asp Forum - Excess whitespace in my soup

John Machin

1/19/2008 11:39:00 AM

I'm trying to recover the original data from some HTML written by a
well-known application.

Here are three original data items, in Python repr() format, with
spaces changed to tildes for clarity:

u'Saturday,~19~January~2008'
u'Line1\nLine2\nLine3'
u'foonly~frabjous\xa0farnarklingliness'

Here is the HTML, with spaces changed to tildes, angle brackets
changed to square brackets,
omitting \r\n from the end of each line, and stripping a large number
of attributes from the [td] tags.

~~[td]Saturday,~19
~~January~2008[/td]
~~[td]Line1[br]
~~~~Line2[br]
~~~~Line3[/td]
~~[td]foonly
~~frabjous farnarklingliness[/td]

Here are the results of feeding it to ElementSoup:

>>> import ElementSoup as ES
>>> elem = ES.parse('ws_soup1.htm')
>>> from pprint import pprint as pp
>>> pp([(e.tag, e.text, e.tail) for e in elem.getiterator()])
[snip]
(u'td', u'Saturday, 19\n January 2008', u'\n'),
(u'td', u'Line1', u'\n'),
(u'br', None, u'\n Line2'),
(u'br', None, u'\n Line3'),
(u'td', u'foonly\n frabjous\xa0farnarklingliness', u'\n')]

I'm happy enough with reassembling the second item. The problem is in
reliably and
correctly collapsing the whitespace in each of the above five
elements. The standard Python
idiom of u' '.join(text.split()) won't work because the text is
Unicode and u'\xa0' is whitespace
and would be converted to a space.

Should whitespace collapsing be done earlier? Note that BeautifulSoup
leaves it as   -- ES does the conversion to \xa0 ...

Does anyone know of an html_collapse_whitespace() for Python? Am I
missing something obvious?

Thanks in advance,
John

5 Answers

Fredrik Lundh

1/19/2008 12:01:00 PM

John Machin wrote:

> I'm happy enough with reassembling the second item. The problem is in
> reliably and correctly collapsing the whitespace in each of the above
> fiveelements. The standard Python idiom of u' '.join(text.split())
> won't work because the text is Unicode and u'\xa0' is whitespace
> and would be converted to a space.

would this (or some variation of it) work?

>>> re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")
u'foo frab\xa0farn'

</F>

John Machin

1/19/2008 12:20:00 PM

On Jan 19, 11:00 pm, Fredrik Lundh <fred...@pythonware.com> wrote:
> John Machin wrote:
> > I'm happy enough with reassembling the second item. The problem is in
> > reliably and correctly collapsing the whitespace in each of the above
>
> > fiveelements. The standard Python idiom of u' '.join(text.split())
> > won't work because the text is Unicode and u'\xa0' is whitespace
>
> > and would be converted to a space.
>
> would this (or some variation of it) work?
>
> >>> re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")
> u'foo frab\xa0farn'
>
> </F>

Yes, partially. Leading and trailing whitespace has to be removed
entirely, not replaced by one space.

Cheers,
John

Stefan Behnel

1/19/2008 1:35:00 PM

John Machin wrote:
> On Jan 19, 11:00 pm, Fredrik Lundh <fred...@pythonware.com> wrote:
>> John Machin wrote:
>>> I'm happy enough with reassembling the second item. The problem is in
>>> reliably and correctly collapsing the whitespace in each of the above
>> > fiveelements. The standard Python idiom of u' '.join(text.split())
>> > won't work because the text is Unicode and u'\xa0' is whitespace
>>
>>> and would be converted to a space.
>> would this (or some variation of it) work?
>>
>> >>> re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")
>> u'foo frab\xa0farn'
>>
>> </F>
>
> Yes, partially. Leading and trailing whitespace has to be removed
> entirely, not replaced by one space.

Sounds like adding a .strip() to me ...

Stefan

John Machin

1/20/2008 10:38:00 AM

Stefan Behnel wrote:
> John Machin wrote:
>
>> On Jan 19, 11:00 pm, Fredrik Lundh <fred...@pythonware.com> wrote:
>>
>>> John Machin wrote:
>>>
>>>> I'm happy enough with reassembling the second item. The problem is in
>>>> reliably and correctly collapsing the whitespace in each of the above
>>>>
>>> > fiveelements. The standard Python idiom of u' '.join(text.split())
>>> > won't work because the text is Unicode and u'\xa0' is whitespace
>>>
>>>
>>>> and would be converted to a space.
>>>>
>>> would this (or some variation of it) work?
>>>
>>> >>> re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")
>>> u'foo frab\xa0farn'
>>>
>>> </F>
>>>
>> Yes, partially. Leading and trailing whitespace has to be removed
>> entirely, not replaced by one space.
>>
>
> Sounds like adding a .strip() to me ...
>
>
>

Sounds like adding a .strip(u' ') to me, otherwise any leading/trailing
u'\xa0' gets blown away and this must not happen.

John Machin

1/20/2008 10:48:00 AM

Remco Gerlich wrote:
> Not sure if this is sufficient for what you need, but how about
>
> import re
> re.sub(u'[\s\xa0]+', ' ', s)
>
> That should replace all occurances of 1 or more whitespace or \xa0
> characters, by a single space.
>
It does indeed, and so does
re.sub(u'\s\+', ' ', s)
because u'\xa0' *IS* whitespace in the Python unicode world, but it's
not whitespace in the HTML sense and it must be preserved.

Cheers,
John

comp.lang.python

Excess whitespace in my soup

John Machin

Fredrik Lundh

John Machin

Stefan Behnel

John Machin

John Machin

x Login to ForumsZone