Asp Forum - HTML parsing confusion

Alnilam

1/22/2008 5:32:00 AM

Sorry for the noob question, but I've gone through the documentation
on python.org, tried some of the diveintopython and boddie's examples,
and looked through some of the numerous posts in this group on the
subject and I'm still rather confused. I know that there are some
great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
am trying to accomplish a simple task with a minimal (as in nil)
amount of adding in modules that aren't "stock" 2.5, and writing a
huge class of my own (or copying one from diveintopython) seems
overkill for what I want to do.

Here's what I want to accomplish... I want to open a page, identify a
specific point in the page, and turn the information there into
plaintext. For example, on the www.diveintopython.org page, I want to
turn the paragraph that starts "Translations are freely
permitted" (and ends ..."let me know"), into a string variable.

Opening the file seems pretty straightforward.

>>> import urllib
>>> page = urllib.urlopen("http://diveintopython....)
>>> source = page.read()
>>> page.close()

gets me to a string variable consisting of the un-parsed contents of
the page.
Now things get confusing, though, since there appear to be several
approaches.
One that I read somewhere was:

>>> from xml.dom.ext.reader import HtmlLib
>>> reader = HtmlLib.Reader()
>>> doc = reader.fromString(source)

This gets me doc as <HTML Document at 9b4758>

>>> paragraphs = doc.getElementsByTagName('p')

gets me all of the paragraph children, and the one I specifically want
can then be referenced with: paragraphs[5] This method seems to be
pretty straightforward, but what do I do with it to get it into a
string cleanly?

>>> from xml.dom.ext import PrettyPrint
>>> PrettyPrint(paragraphs[5])

shows me the text, but still in html, and I can't seem to get it to
turn into a string variable, and I think the PrettyPrint function is
unnecessary for what I want to do. Formatter seems to do what I want,
but I can't figure out how to link the "Element Node" at
paragraphs[5] with the formatter functions to produce the string I
want as output. I tried some of the htmllib.HTMLParser(formatter
stuff) examples, but while I can supposedly get that to work with
formatter a little easier, I can't figure out how to get HTMLParser to
drill down specifically to the 6th paragraph's contents.

Thanks in advance.

- A.

15 Answers

John Machin

1/22/2008 9:34:00 AM

On Jan 22, 4:31 pm, Alnilam <alni...@gmail.com> wrote:
> Sorry for the noob question, but I've gone through the documentation
> on python.org, tried some of the diveintopython and boddie's examples,
> and looked through some of the numerous posts in this group on the
> subject and I'm still rather confused. I know that there are some
> great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
> am trying to accomplish a simple task with a minimal (as in nil)
> amount of adding in modules that aren't "stock" 2.5, and writing a
> huge class of my own (or copying one from diveintopython) seems
> overkill for what I want to do.
>
> Here's what I want to accomplish... I want to open a page, identify a
> specific point in the page, and turn the information there into
> plaintext. For example, on thewww.diveintopython.orgpage, I want to
> turn the paragraph that starts "Translations are freely
> permitted" (and ends ..."let me know"), into a string variable.
>
> Opening the file seems pretty straightforward.
>
> >>> import urllib
> >>> page = urllib.urlopen("http://diveintopython....)
> >>> source = page.read()
> >>> page.close()
>
> gets me to a string variable consisting of the un-parsed contents of
> the page.
> Now things get confusing, though, since there appear to be several
> approaches.
> One that I read somewhere was:
>
> >>> from xml.dom.ext.reader import HtmlLib

Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
-1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
200-modules PyXML package installed. And you don't want the 75Kb
BeautifulSoup?

Paul Boddie

1/22/2008 12:37:00 PM

On 22 Jan, 06:31, Alnilam <alni...@gmail.com> wrote:
> Sorry for the noob question, but I've gone through the documentation
> on python.org, tried some of the diveintopython and boddie's examples,
> and looked through some of the numerous posts in this group on the
> subject and I'm still rather confused. I know that there are some
> great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
> am trying to accomplish a simple task with a minimal (as in nil)
> amount of adding in modules that aren't "stock" 2.5, and writing a
> huge class of my own (or copying one from diveintopython) seems
> overkill for what I want to do.

It's unfortunate that you don't want to install extra modules, but I'd
probably use libxml2dom [1] for what you're about to describe...

> Here's what I want to accomplish... I want to open a page, identify a
> specific point in the page, and turn the information there into
> plaintext. For example, on thewww.diveintopython.orgpage, I want to
> turn the paragraph that starts "Translations are freely
> permitted" (and ends ..."let me know"), into a string variable.
>
> Opening the file seems pretty straightforward.
>
> >>> import urllib
> >>> page = urllib.urlopen("http://diveintopython....)
> >>> source = page.read()
> >>> page.close()
>
> gets me to a string variable consisting of the un-parsed contents of
> the page.

Yes, there may be shortcuts that let some parsers read directly from
the server, but it's always good to have the page text around, anyway.

> Now things get confusing, though, since there appear to be several
> approaches.
> One that I read somewhere was:
>
> >>> from xml.dom.ext.reader import HtmlLib
> >>> reader = HtmlLib.Reader()
> >>> doc = reader.fromString(source)
>
> This gets me doc as <HTML Document at 9b4758>
>
> >>> paragraphs = doc.getElementsByTagName('p')
>
> gets me all of the paragraph children, and the one I specifically want
> can then be referenced with: paragraphs[5] This method seems to be
> pretty straightforward, but what do I do with it to get it into a
> string cleanly?

In less sophisticated DOM implementations, what you'd do is to loop
over the "descendant" nodes of the paragraph which are text nodes and
concatenate them.

> >>> from xml.dom.ext import PrettyPrint
> >>> PrettyPrint(paragraphs[5])
>
> shows me the text, but still in html, and I can't seem to get it to
> turn into a string variable, and I think the PrettyPrint function is
> unnecessary for what I want to do.

Yes, PrettyPrint is for prettyprinting XML. You just want to visit and
collect the text nodes.

> Formatter seems to do what I want,
> but I can't figure out how to link the "Element Node" at
> paragraphs[5] with the formatter functions to produce the string I
> want as output. I tried some of the htmllib.HTMLParser(formatter
> stuff) examples, but while I can supposedly get that to work with
> formatter a little easier, I can't figure out how to get HTMLParser to
> drill down specifically to the 6th paragraph's contents.

Given that you've found the paragraph above, you just need to write a
recursive function which visits child nodes, and if it finds a text
node then it collects the value of the node in a list; otherwise, for
elements, it visits the child nodes of that element; and so on. The
recursive approach is presumably what the formatter uses, but I can't
say that I've really looked at it.

Meanwhile, with libxml2dom, you'd do something like this:

import libxml2dom
d = libxml2dom.parseURI("http://www.diveintopython...., html=1)
saved = None

# Find the paragraphs.
for p in d.xpath("//p"):

# Get the text without leading and trailing space.
text = p.textContent.strip()

# Save the appropriate paragraph text.
if text.startswith("Translations are freely permitted") and text.endswith("just let me know."):

saved = text
break

The magic part of this code which saves you from needing to write that
recursive function mentioned above is the textContent property on the
paragraph element.

Paul

[1] http://www.python.org/pypi/...

Alnilam

1/22/2008 1:45:00 PM

> Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
> -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
> 200-modules PyXML package installed. And you don't want the 75Kb
> BeautifulSoup?

I wasn't aware that I had PyXML installed, and can't find a reference
to having it installed in pydocs. And that highlights the problem I
have at the moment with using other modules. I move from computer to
computer regularly, and while all have a recent copy of Python, each
has different (or no) extra modules, and I don't always have the
luxury of downloading extras. That being said, if there's a simple way
of doing it with BeautifulSoup, please show me an example. Maybe I can
figure out a way to carry the extra modules I need around with me.

Paul McGuire

1/22/2008 2:31:00 PM

On Jan 22, 7:44 am, Alnilam <alni...@gmail.com> wrote:
> ...I move from computer to
> computer regularly, and while all have a recent copy of Python, each
> has different (or no) extra modules, and I don't always have the
> luxury of downloading extras. That being said, if there's a simple way
> of doing it with BeautifulSoup, please show me an example. Maybe I can
> figure out a way to carry the extra modules I need around with me.

Pyparsing's footprint is intentionally small - just one pyparsing.py
file that you can drop into a directory next to your own script. And
the code to extract paragraph 5 of the "Dive Into Python" home page?
See annotated code below.

-- Paul

from pyparsing import makeHTMLTags, SkipTo, anyOpenTag, anyCloseTag
import urllib
import textwrap

page = urllib.urlopen("http://diveintopython....)
source = page.read()
page.close()

# define a simple paragraph matcher
pStart,pEnd = makeHTMLTags("P")
paragraph = pStart.suppress() + SkipTo(pEnd) + pEnd.suppress()

# get all paragraphs from the input string (or use the
# scanString generator function to stop at the correct
# paragraph instead of reading them all)
paragraphs = paragraph.searchString(source)

# create a transformer that will strip HTML tags
tagStripper = anyOpenTag.suppress() | anyCloseTag.suppress()

# get paragraph[5] and strip the HTML tags
p5TextOnly = tagStripper.transformString(paragraphs[5][0])

# remove extra whitespace
p5TextOnly = " ".join(p5TextOnly.split())

# print out a nicely wrapped string - so few people know
# that textwrap is part of the standard Python distribution,
# but it is very handy
print textwrap.fill(p5TextOnly, 60)

Alnilam

1/22/2008 4:02:00 PM

On Jan 22, 8:44 am, Alnilam <alni...@gmail.com> wrote:
> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
> > 200-modules PyXML package installed. And you don't want the 75Kb
> > BeautifulSoup?
>
> I wasn't aware that I had PyXML installed, and can't find a reference
> to having it installed in pydocs. ...

Ugh. Found it. Sorry about that, but I still don't understand why
there isn't a simple way to do this without using PyXML, BeautifulSoup
or libxml2dom. What's the point in having sgmllib, htmllib,
HTMLParser, and formatter all built in if I have to use use someone
else's modules to write a couple of lines of code that achieve the
simple thing I want. I get the feeling that this would be easier if I
just broke down and wrote a couple of regular expressions, but it
hardly seems a 'pythonic' way of going about things.

# get the source (assuming you don't have it locally and have an
internet connection)
>>> import urllib
>>> page = urllib.urlopen("http://diveintopython....)
>>> source = page.read()
>>> page.close()

# set up some regex to find tags, strip them out, and correct some
formatting oddities
>>> import re
>>> p = re.compile(r'(<p.*?>.*?</p>)',re.DOTALL)
>>> tag_strip = re.compile(r'>(.*?)<',re.DOTALL)
>>> fix_format = re.compile(r'\n +',re.MULTILINE)

# achieve clean results.
>>> paragraphs = re.findall(p,source)
>>> text_list = re.findall(tag_strip,paragraphs[5])
>>> text = "".join(text_list)
>>> clean_text = re.sub(fix_format," ",text)

This works, and is small and easily reproduced, but seems like it
would break easily and seems a waste of other *ML specific parsers.

Diez B. Roggisch

1/22/2008 4:40:00 PM

Alnilam wrote:

> On Jan 22, 8:44 am, Alnilam <alni...@gmail.com> wrote:
>> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
>> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
>> > 200-modules PyXML package installed. And you don't want the 75Kb
>> > BeautifulSoup?
>>
>> I wasn't aware that I had PyXML installed, and can't find a reference
>> to having it installed in pydocs. ...
>
> Ugh. Found it. Sorry about that, but I still don't understand why
> there isn't a simple way to do this without using PyXML, BeautifulSoup
> or libxml2dom. What's the point in having sgmllib, htmllib,
> HTMLParser, and formatter all built in if I have to use use someone
> else's modules to write a couple of lines of code that achieve the
> simple thing I want. I get the feeling that this would be easier if I
> just broke down and wrote a couple of regular expressions, but it
> hardly seems a 'pythonic' way of going about things.

This is simply a gross misunderstanding of what BeautifulSoup or lxml
accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
sense is by no means trivial. And just because you can come up with a few
lines of code using rexes that work for your current use-case doesn't mean
that they serve as general html-fixing-routine. Or do you think the rather
long history and 75Kb of code for BS are because it's creator wasn't aware
of rexes?

And it also makes no sense stuffing everything remotely useful into the
standard lib. This would force to align development and release cycles,
resulting in much less features and stability as it can be wished.

And to be honest: I fail to see where your problem is. BeatifulSoup is a
single Python file. So whatever you carry with you from machine to machine,
if it's capable of holding a file of your own code, you can simply put
BeautifulSoup beside it - even if it was a floppy disk.

Diez

Alnilam

1/22/2008 9:21:00 PM

On Jan 22, 11:39 am, "Diez B. Roggisch" <de...@nospam.web.de> wrote:
> Alnilam wrote:
> > On Jan 22, 8:44 am, Alnilam <alni...@gmail.com> wrote:
> >> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
> >> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
> >> > 200-modules PyXML package installed. And you don't want the 75Kb
> >> > BeautifulSoup?
>
> >> I wasn't aware that I had PyXML installed, and can't find a reference
> >> to having it installed in pydocs. ...
>
> > Ugh. Found it. Sorry about that, but I still don't understand why
> > there isn't a simple way to do this without using PyXML, BeautifulSoup
> > or libxml2dom. What's the point in having sgmllib, htmllib,
> > HTMLParser, and formatter all built in if I have to use use someone
> > else's modules to write a couple of lines of code that achieve the
> > simple thing I want. I get the feeling that this would be easier if I
> > just broke down and wrote a couple of regular expressions, but it
> > hardly seems a 'pythonic' way of going about things.
>
> This is simply a gross misunderstanding of what BeautifulSoup or lxml
> accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
> sense is by no means trivial. And just because you can come up with a few
> lines of code using rexes that work for your current use-case doesn't mean
> that they serve as general html-fixing-routine. Or do you think the rather
> long history and 75Kb of code for BS are because it's creator wasn't aware
> of rexes?
>
> And it also makes no sense stuffing everything remotely useful into the
> standard lib. This would force to align development and release cycles,
> resulting in much less features and stability as it can be wished.
>
> And to be honest: I fail to see where your problem is. BeatifulSoup is a
> single Python file. So whatever you carry with you from machine to machine,
> if it's capable of holding a file of your own code, you can simply put
> BeautifulSoup beside it - even if it was a floppy disk.
>
> Diez

I am, by no means, trying to trivialize the work that goes into
creating the numerous modules out there. However as a relatively
novice programmer trying to figure out something, the fact that these
modules are pushed on people with such zealous devotion that you take
offense at my desire to not use them gives me a bit of pause. I use
non-included modules for tasks that require them, when the capability
to do something clearly can't be done easily another way (eg.
MySQLdb). I am sure that there will be plenty of times where I will
use BeautifulSoup. In this instance, however, I was trying to solve a
specific problem which I attempted to lay out clearly from the
outset.

I was asking this community if there was a simple way to use only the
tools included with Python to parse a bit of html.

If the answer is no, that's fine. Confusing, but fine. If the answer
is yes, great. I look forward to learning from someone's example. If
you don't have an answer, or a positive contribution, then please
don't interject your angst into this thread.

Gabriel Genellina

1/23/2008 12:30:00 AM

En Tue, 22 Jan 2008 19:20:32 -0200, Alnilam <alnilam@gmail.com> escribió:

> On Jan 22, 11:39 am, "Diez B. Roggisch" <de...@nospam.web.de> wrote:
>> Alnilam wrote:
>> > On Jan 22, 8:44 am, Alnilam <alni...@gmail.com> wrote:
>> >> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
>> >> > -1)) doesn't have an xml.dom.ext ... you must have the
>> mega-monstrous
>> >> > 200-modules PyXML package installed. And you don't want the 75Kb
>> >> > BeautifulSoup?
>> > Ugh. Found it. Sorry about that, but I still don't understand why
>> > there isn't a simple way to do this without using PyXML, BeautifulSoup
>> > or libxml2dom. What's the point in having sgmllib, htmllib,
>> > HTMLParser, and formatter all built in if I have to use use someone
>> > else's modules to write a couple of lines of code that achieve the
>> > simple thing I want. I get the feeling that this would be easier if I
>> > just broke down and wrote a couple of regular expressions, but it
>> > hardly seems a 'pythonic' way of going about things.
>>
>> This is simply a gross misunderstanding of what BeautifulSoup or lxml
>> accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
>> sense is by no means trivial. And just because you can come up with a
>> few
>> lines of code using rexes that work for your current use-case doesn't
>> mean
>> that they serve as general html-fixing-routine. Or do you think the
>> rather
>> long history and 75Kb of code for BS are because it's creator wasn't
>> aware
>> of rexes?
>
> I am, by no means, trying to trivialize the work that goes into
> creating the numerous modules out there. However as a relatively
> novice programmer trying to figure out something, the fact that these
> modules are pushed on people with such zealous devotion that you take
> offense at my desire to not use them gives me a bit of pause. I use
> non-included modules for tasks that require them, when the capability
> to do something clearly can't be done easily another way (eg.
> MySQLdb). I am sure that there will be plenty of times where I will
> use BeautifulSoup. In this instance, however, I was trying to solve a
> specific problem which I attempted to lay out clearly from the
> outset.
>
> I was asking this community if there was a simple way to use only the
> tools included with Python to parse a bit of html.

If you *know* that your document is valid HTML, you can use the HTMLParser
module in the standard Python library. Or even the parser in the htmllib
module. But a lot of HTML pages out there are invalid, some are grossly
invalid, and those parsers are just unable to handle them. This is why
modules like BeautifulSoup exist: they contain a lot of heuristics and
trial-and-error and personal experience from the developers, in order to
guess more or less what the page author intended to write and make some
sense of that "tag soup".
A guesswork like that is not suitable for the std lib ("Errors should
never pass silently" and "In the face of ambiguity, refuse the temptation
to guess.") but makes a perfect 3rd party module.

If you want to use regular expressions, and that works OK for the
documents you are handling now, fine. But don't complain when your RE's
match too much or too little or don't match at all because of unclosed
tags, improperly nested tags, nonsense markup, or just a valid combination
that you didn't take into account.

--
Gabriel Genellina

elijahu@gmail.com

1/23/2008 3:19:00 AM

On Jan 22, 7:29 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:

>
> > I was asking this community if there was a simple way to use only the
> > tools included with Python to parse a bit of html.
>
> If you *know* that your document is valid HTML, you can use the HTMLParser
> module in the standard Python library. Or even the parser in the htmllib
> module. But a lot of HTML pages out there are invalid, some are grossly
> invalid, and those parsers are just unable to handle them. This is why
> modules like BeautifulSoup exist: they contain a lot of heuristics and
> trial-and-error and personal experience from the developers, in order to
> guess more or less what the page author intended to write and make some
> sense of that "tag soup".
> A guesswork like that is not suitable for the std lib ("Errors should
> never pass silently" and "In the face of ambiguity, refuse the temptation
> to guess.") but makes a perfect 3rd party module.
>
> If you want to use regular expressions, and that works OK for the
> documents you are handling now, fine. But don't complain when your RE's
> match too much or too little or don't match at all because of unclosed
> tags, improperly nested tags, nonsense markup, or just a valid combination
> that you didn't take into account.
>
> --
> Gabriel Genellina

Thank you. That does make perfect sense, and is a good clear position
on the up and down side of what I'm trying to do, as well as a good
explanation for why BeautifulSoup will probably remain outside the std
lib. I'm sure that I will get plenty of use out of it.

If, however, I am sure that the html code in target documents is
good, and the framework html doesn't change, just the data on page
after page of static html, would it be better to just go with regex or
with one of the std lib items you mentioned. I thought the latter, but
I'm stuck on how to make them generate results similar to the code I
put above as an example. I'm not trying to code this to go against
html in the wild, but to try to strip specific, consistently located
data from the markup and turn it into something more useful.

I may have confused folks by using the www.diveintopython.org page as
an example, but its html seemed to be valid strict tags.

Alnilam

1/23/2008 3:41:00 AM

On Jan 22, 7:29 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
>
> > I was asking this community if there was a simple way to use only the
> > tools included with Python to parse a bit of html.
>
> If you *know* that your document is valid HTML, you can use the HTMLParser
> module in the standard Python library. Or even the parser in the htmllib
> module. But a lot of HTML pages out there are invalid, some are grossly
> invalid, and those parsers are just unable to handle them. This is why
> modules like BeautifulSoup exist: they contain a lot of heuristics and
> trial-and-error and personal experience from the developers, in order to
> guess more or less what the page author intended to write and make some
> sense of that "tag soup".
> A guesswork like that is not suitable for the std lib ("Errors should
> never pass silently" and "In the face of ambiguity, refuse the temptation
> to guess.") but makes a perfect 3rd party module.
>
> If you want to use regular expressions, and that works OK for the
> documents you are handling now, fine. But don't complain when your RE's
> match too much or too little or don't match at all because of unclosed
> tags, improperly nested tags, nonsense markup, or just a valid combination
> that you didn't take into account.
>
> --
> Gabriel Genellina

Thanks, Gabriel. That does make sense, both what the benefits of
BeautifulSoup are and why it probably won't become std lib anytime
soon.

The pages I'm trying to write this code to run against aren't in the
wild, though. They are static html files on my company's lan, are very
consistent in format, and are (I believe) valid html. They just have
specific paragraphs of useful information, located in the same place
in each file, that I want to 'harvest' and put to better use. I used
diveintopython.org as an example only (and in part because it had good
clean html formatting). I am pretty sure that I could craft some
regular expressions to do the work -- which of course would not be the
case if I was screen scraping web pages in the 'wild' -- but I was
trying to find a way to do that using one of those std libs you
mentioned.

I'm not sure if HTMLParser or htmllib would work better to achieve the
same effect as the regex example I gave above, or how to get them to
do that. I thought I'd come close, but as someone pointed out early
on, I'd accidently tapped into PyXML which is installed where I was
testing code, but not necessarily where I need it. It may turn out
that the regex way works faster, but falling back on methods I'm
comfortable with doesn't help expand my Python knowledge.

So if anyone can tell me how to get HTMLParser or htmllib to grab a
specific paragraph, and then provide the text in that paragraph in a
clean, markup-free format, I'd appreciate it.

comp.lang.python

HTML parsing confusion

Alnilam

John Machin

Paul Boddie

Alnilam

Paul McGuire

Alnilam

Diez B. Roggisch

Alnilam

Gabriel Genellina

elijahu@gmail.com

Alnilam

x Login to ForumsZone