Asp Forum - How to convert markup text to plain text in python?

geoffbache

2/1/2008 4:08:00 PM

I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "Today is Friday" to "Today is Friday")

Regards,
Geoff

8 Answers

Tim Chase

2/1/2008 4:27:00 PM

> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
>
> I've looked around a bit but failed to find anything, any tips?
>
> (e.g. convert "Today is Friday" to "Today is Friday")

Well, if all you want to do is remove everything from a "<" to a
">", you can use

>>> s = "Today is Friday"
>>> import re
>>> r = re.compile('<[^>]*>')
>>> print r.sub('', s)
Today is Friday

it should even work for semi-pathological cases such as

s = """You can find my <a
href='http://example.com'>th...
> online"""

where the tag contents are split across lines. There are more
pathological cases where tags aren't well-formed, e.g.

s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"

in which case you get what you deserve for making such
pathological conditions ;-)

-tkc

ph

2/1/2008 4:34:00 PM

On 01-Feb-2008, geoffbache wrote:
> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
>
> I've looked around a bit but failed to find anything, any tips?
>
> (e.g. convert "Today is Friday" to "Today is Friday")

Quick but very dirty way:

data=urllib.urlopen('http://googl...).read()
data=''.join([x.split('>',1)[-1] for x in data.split('<')])

Steve Holden

2/1/2008 4:44:00 PM

Tim Chase wrote:
>> I have some marked up text and would like to convert it to plain text,
>> by simply removing all the tags. Of course I can do it from first
>> principles but I felt that among all Python's markup tools there must
>> be something that would do this simply, without having to create an
>> XML parser etc.
>>
>> I've looked around a bit but failed to find anything, any tips?
>>
>> (e.g. convert "Today is Friday" to "Today is Friday")
>
>
> Well, if all you want to do is remove everything from a "<" to a
> ">", you can use
>
> >>> s = "Today is Friday"
> >>> import re
> >>> r = re.compile('<[^>]*>')
> >>> print r.sub('', s)
> Today is Friday
>
> it should even work for semi-pathological cases such as
>
> s = """You can find my <a
> href='http://example.com'>th...
> > online"""
>
> where the tag contents are split across lines. There are more
> pathological cases where tags aren't well-formed, e.g.
>
> s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"
>
> in which case you get what you deserve for making such
> pathological conditions ;-)
>
The real answer to this question is "learn how to use Beautiful Soup" --
see http://www.crummy.com/software/Beau...

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.hold...

Tim Chase

2/1/2008 4:54:00 PM

>> Well, if all you want to do is remove everything from a "<" to a
>> ">", you can use
>>
>> >>> s = "Today is Friday"
>> >>> import re
>> >>> r = re.compile('<[^>]*>')
>> >>> print r.sub('', s)
>> Today is Friday
>>
[Tim's ramblings about pathological cases snipped]
>
> The real answer to this question is "learn how to use Beautiful Soup" --
> see http://www.crummy.com/software/Beau...

Yes, for more pathological cases, BS does a great job of parsing
junk :)

However, as BS isn't batteries-included [Aside: BS and pyparsing
are two common solutions to problems that would make great
additions to the standard library], using a RE to make a
best-effort guess is a good first approximation of a solution
without needing to download extra packages--no matter how useful
those extra packages may be.

-tkc

Paul McGuire

2/1/2008 5:21:00 PM

On Feb 1, 10:54 am, Tim Chase <python.l...@tim.thechases.com> wrote:
> >> Well, if all you want to do is remove everything from a "<" to a
> >> ">", you can use
>
> >> >>> s = "Today is Friday"
> >> >>> import re
> >> >>> r = re.compile('<[^>]*>')
> >> >>> print r.sub('', s)
> >> Today is Friday
>
> [Tim's ramblings about pathological cases snipped]

pyparsing includes an example script for stripping tags from HTML
source. See it on the wiki at http://pyparsing.wikispaces.com/space/showimage/htmlS....

-- Paul

Zentrader

2/2/2008 4:44:00 PM

On Feb 1, 8:07 am, geoffbache <geoff.ba...@pobox.com> wrote:
> I have some marked up text and would like to convert it to plain text,

If this is just a quick and dirty problem, you can also use one of the
lynx/elinks/links2 browsers and dump the contents to a file. On Linux
it would be
lynx -dump http... > text.txt
Lynx is also available for MS Windows, but am not sure about the other
two.

Stefan Behnel

2/3/2008 5:35:00 PM

geoffbache wrote:
> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
>
> I've looked around a bit but failed to find anything, any tips?
>
> (e.g. convert "Today is Friday" to "Today is Friday")

>>> import lxml.etree as et
>>> doc = et.HTML("Today is Friday")
>>> et.tostring(doc, method='text', encoding=unicode)
u'Today is Friday'

http://codespea...

Stefan

Stefan Behnel

2/11/2008 10:03:00 AM

geoffbache wrote:
> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
>
> I've looked around a bit but failed to find anything, any tips?
>
> (e.g. convert "Today is Friday" to "Today is Friday")

This might be of interest:

http://pypi.python.org/pypi/hau...

Stefan

comp.lang.python

How to convert markup text to plain text in python?

geoffbache

Tim Chase

ph

Steve Holden

Tim Chase

Paul McGuire

Zentrader

Stefan Behnel

Stefan Behnel

x Login to ForumsZone