[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

I'm looking for html cleaner. Example : convert

my title

=>

my title

Stéphane Klein

3/29/2010 8:12:00 AM

Hi,

I work on HTML cleaner.

I export OpenOffice.org documents to HTML.
Next, I would like clean this HTML export files :

* remove comment
* remove style
* remove dispensable tag
* ...

some difficulty :

* convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
* convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>

to do this process, I use lxml and pyquery.

Question :

* are there some xml helper tools in Python to do this process ? I've
looked for in pypi, I found nothing about it

If you confirm than this tools don't exists, I'll maybe publish a helper
package to do this "clean" processing.

Thanks for your help,
Stephane

1 Answer

Harishankar

3/29/2010 9:10:00 AM

0

On Mon, 29 Mar 2010 10:12:09 +0200, Stéphane Klein wrote:

> Hi,
>
> I work on HTML cleaner.
>
> I export OpenOffice.org documents to HTML. Next, I would like clean this
> HTML export files :
>
> * remove comment
> * remove style
> * remove dispensable tag
> * ...
>
> some difficulty :
>
> * convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
> * convert <h1><span><font>my title</font></span></h1> => <h1>my
> title</h1>
>
> to do this process, I use lxml and pyquery.
>
> Question :
>
> * are there some xml helper tools in Python to do this process ? I've
> looked for in pypi, I found nothing about it
>
> If you confirm than this tools don't exists, I'll maybe publish a helper
> package to do this "clean" processing.
>
> Thanks for your help,
> Stephane


Take a look at htmllib and HTMLParser (two different modules) in the
Python built-in library.

In Python 3.x there is one called html.parser

You can use this to parse out specific tags from HTML documents. If you
want something more advanced, consider using XML.





--
Harishankar (http://haris... http://literary...)