Asp Forum - Help Parsing an HTML File

egonslokar

2/15/2008 9:29:00 PM

Hello Python Community,

It'd be great if someone could provide guidance or sample code for
accomplishing the following:

I have a single unicode file that has descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.

I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.

Any tips, advice and guidance is greatly appreciated.

Thanks,

Egon

=====OUTPUT-FILE=====
/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1 H2 DIV Segment1 Segment2 Segment3
RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
RoséSegmentDIV3-1
PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value
BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV1-3 No-Value No-Value
YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDIV1-4
YellowSegmentDIV2-4 No-Value
------Tab Separated Output File End------

=====HTML-EXAMPLE=====
------HTML Example Begin------
<html>

<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
<div>RoséDIV-1</div>
<div "segment1">RoséSegmentDIV1-1</div><br>
<div "segment2">RoséSegmentDIV2-1</div><br>
<div "segment3">RoséSegmentDIV3-1</div><br>
<br>
<br>

<h1>PinkH1-2</h1>
<h2>PinkH2-2</h2>
<div>PinkDIV2-2</div>
<div "segment1">PinkSegmentDIV1-2</div><br>
<br>
<comment></comment>

<h1>BlackH1-3</h1>
<h2>BlackH2-3</h2>
<div>BlackDIV2-3</div>
<div "segment1">BlackSegmentDIV1-3</div><br>

<h1>YellowH1-4</h1>
<h2>YellowH2-4</h2>
<div>YellowDIV2-4</div>
<div "segment1">YellowSegmentDIV1-4</div><br>
<div "segment2">YellowSegmentDIV2-4</div><br>

</html>
------HTML Example End------

6 Answers

Tim Chase

2/15/2008 9:43:00 PM

> I need to parse the file in such a way to extract data out of the html
> and to come up with a tab separated file that would look like OUTPUT-
> FILE below.

BeautifulSoup[1]. Your one-stop-shop for all your HTML parsing
needs.

What you do with the parsed data, is an exercise left to the
reader, but it's navigable.

-tkc

[1] http://www.crummy.com/software/Beau...

Mike Driscoll

2/15/2008 10:07:00 PM

7stud --

2/16/2008 2:57:00 AM

On Feb 15, 2:28 pm, egonslo...@gmail.com wrote:
> Hello Python Community,
>
> It'd be great if someone could provide guidance or sample code for
> accomplishing the following:
>
> I have a single unicode file that has descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>
> I need to parse the file in such a way to extract data out of the html
> and to come up with a tab separated file that would look like OUTPUT-
> FILE below.
>
> Any tips, advice and guidance is greatly appreciated.
>
> Thanks,
>
> Egon
>
> =====OUTPUT-FILE=====
> /please note that the first line of the file contains column headers/
> ------Tab Separated Output File Begin------
> H1 H2 DIV Segment1 Segment2 Segment3
> RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
> RoséSegmentDIV3-1
> PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value
> BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV1-3 No-Value No-Value
> YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDIV1-4
> YellowSegmentDIV2-4 No-Value
> ------Tab Separated Output File End------
>
> =====HTML-EXAMPLE=====
> ------HTML Example Begin------
> <html>
>
> <h1>RoséH1-1</h1>
> <h2>RoséH2-1</h2>
> <div>RoséDIV-1</div>
> <div "segment1">RoséSegmentDIV1-1</div><br>
> <div "segment2">RoséSegmentDIV2-1</div><br>
> <div "segment3">RoséSegmentDIV3-1</div><br>
> <br>
> <br>
>
> <h1>PinkH1-2</h1>
> <h2>PinkH2-2</h2>
> <div>PinkDIV2-2</div>
> <div "segment1">PinkSegmentDIV1-2</div><br>
> <br>
> <comment></comment>
>
> <h1>BlackH1-3</h1>
> <h2>BlackH2-3</h2>
> <div>BlackDIV2-3</div>
> <div "segment1">BlackSegmentDIV1-3</div><br>
>
> <h1>YellowH1-4</h1>
> <h2>YellowH2-4</h2>
> <div>YellowDIV2-4</div>
> <div "segment1">YellowSegmentDIV1-4</div><br>
> <div "segment2">YellowSegmentDIV2-4</div><br>
>
> </html>
> ------HTML Example End------

Beautiful soup won't help much because the 'attributes' in the tags
are not really attributes, and therefore BeautifulSoup ignores them.
As a result, you'll end up just processing the file line by line.
That can be done just as easily without BeautifulSoup. Based on the
example file you posted, all that is required is a simple regex to
match the text between the single tag on each line, and then just
outputting the data in the order you find it. Pad the end of each
block of data with some No-Values, and you have your desired results.

Post some code with your efforts.

Stefan Behnel

2/16/2008 7:42:00 AM

egonslokar@gmail.com wrote:
> I have a single unicode file that has descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>
> I need to parse the file in such a way to extract data out of the html
> and to come up with a tab separated file that would look like OUTPUT-
> FILE below.
>
> =====OUTPUT-FILE=====
> /please note that the first line of the file contains column headers/
> ------Tab Separated Output File Begin------
> H1 H2 DIV Segment1 Segment2 Segment3
> RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
> ------Tab Separated Output File End------
>
> =====HTML-EXAMPLE=====
> ------HTML Example Begin------
> <html>
>
> <h1>RoséH1-1</h1>
> <h2>RoséH2-1</h2>
> <div>RoséDIV-1</div>
> <div "segment1">RoséSegmentDIV1-1</div><br>
> <div "segment2">RoséSegmentDIV2-1</div><br>
> <div "segment3">RoséSegmentDIV3-1</div><br>
> <br>
> <br>
>
> </html>
> ------HTML Example End------

Now, what ugly markup is that? You will never manage to get any HTML compliant
parser return the "segmentX" stuff in there. I think your best bet is really
going for pyparsing or regular expressions (and I actually recommend pyparsing
here).

Stefan

Peter Otten

2/16/2008 9:11:00 AM

Stefan Behnel wrote:

> egonslokar@gmail.com wrote:
>> I have a single unicode file that has descriptions of hundreds of
>> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>>
>> I need to parse the file in such a way to extract data out of the html
>> and to come up with a tab separated file that would look like OUTPUT-
>> FILE below.
>>
>> =====OUTPUT-FILE=====
>> /please note that the first line of the file contains column headers/
>> ------Tab Separated Output File Begin------
>> H1 H2 DIV Segment1 Segment2 Segment3
>> RosÃ©H1-1 RosÃ©H2-1 RosÃ©DIV-1 RosÃ©SegmentDIV1-1 RosÃ©SegmentDIV2-1
>> ------Tab Separated Output File End------
>>
>> =====HTML-EXAMPLE=====
>> ------HTML Example Begin------
>> <html>
>>
>> <h1>RosÃ©H1-1</h1>
>> <h2>RosÃ©H2-1</h2>
>> <div>RosÃ©DIV-1</div>
>> <div "segment1">RosÃ©SegmentDIV1-1</div><br>
>> <div "segment2">RosÃ©SegmentDIV2-1</div><br>
>> <div "segment3">RosÃ©SegmentDIV3-1</div><br>
>> <br>
>> <br>
>>
>> </html>
>> ------HTML Example End------
>
> Now, what ugly markup is that? You will never manage to get any HTML
> compliant parser return the "segmentX" stuff in there. I think your best
> bet is really going for pyparsing or regular expressions (and I actually
> recommend pyparsing here).
>
> Stefan

In practice the following might be sufficient:

from BeautifulSoup import BeautifulSoup

def chunks(bs):
chunk = []
for tag in bs.findAll(["h1", "h2", "div"]):
if tag.name == "h1":
if chunk:
yield chunk
chunk = []
chunk.append(tag)
if chunk:
yield chunk

def process(filename):
bs = BeautifulSoup(open(filename))
for chunk in chunks(bs):
columns = [tag.string for tag in chunk]
columns += ["No Value"] * (6 - len(columns))
print "\t".join(columns)

if __name__ == "__main__":
process("example.html")

The biggest caveat is that only columns at the end of a row may be left out.

Peter

Paul McGuire

2/17/2008 4:07:00 AM

On Feb 15, 3:28 pm, egonslo...@gmail.com wrote:
> Hello Python Community,
>
> It'd be great if someone could provide guidance or sample code for
> accomplishing the following:
>
> I have a single unicode file that has descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>

Pyparsing was mentioned earlier, here is a sample with some annotating
comments.

I'm a little worried when you say the file "fairly resembles HTML-
EXAMPLE." With parsers, the devil is in the details, and if you have
scrambled this format - the HTML attributes are especially suspicious
- then the parser will need to be cleaned up to match the real input.
If the file being parsed really has proper HTML attributes (of the
form <tag attrname="attrvalue">), then you could simplify the code to
use the pyparsing method makeHTMLTags. But the example I wrote
matches the example you posted.

-- Paul

# encoding=utf-8

from pyparsing import *

data = """
<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
... snip ...
"""
# define <XXX> and </XXX> tags
CL = CaselessLiteral
h1,h2,cmnt,br = map(Suppress,
map(CL,["<%s>" % s for s in "h1 h2 comment br".split()]))
h1end,h2end,cmntEnd,divEnd = map(Suppress,
map(CL,["</%s>" % s for s in "h1 h2 comment div".split()]))
# h1,h1end = makeHTMLTags("h1")

# define special format for <div>, incl. optional quoted string
"attribute"
div = "<" + CL("div") + Optional(QuotedString('"'))("name") + ">"
div.setParseAction(
lambda toks: "name" in toks and toks.name.title() or "DIV")

# define <xxx>body</xxx> entries
h1Entry = h1 + SkipTo(h1end) + h1end
h2Entry = h2 + SkipTo(h2end) + h2end
comment = cmnt + SkipTo(cmntEnd) + cmntEnd
divEntry = div + SkipTo(divEnd) + divEnd

# just return nested tokens
grammar = (OneOrMore(Group(h1Entry +
(Group(h2Entry +
(OneOrMore(Group(divEntry))))))))
grammar.ignore(br)
grammar.ignore(comment)

results = grammar.parseString(data)
from pprint import pprint
pprint(results.asList())
print

# return nested tokens, with dict
grammar = Dict(OneOrMore(Group( h1Entry +
Dict(Group(h2Entry +
Dict(OneOrMore(Group(divEntry))))))))
grammar.ignore(br)
grammar.ignore(comment)
results = grammar.parseString(data)
print results.dump()

Prints:

[['Ros\xe9H1-1',
['Ros\xe9H2-1',
['DIV', 'Ros\xe9DIV-1'],
['Segment1', 'Ros\xe9SegmentDIV1-1'],
['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]],
['PinkH1-2',
['PinkH2-2', ['DIV', 'PinkDIV2-2'], ['Segment1',
'PinkSegmentDIV1-2']]],
['BlackH1-3',
['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]],
['YellowH1-4',
['YellowH2-4',
['DIV', 'YellowDIV2-4'],
['Segment1', 'YellowSegmentDIV1-4'],
['Segment2', 'YellowSegmentDIV2-4']]]]

[['Ros\xe9H1-1', ['Ros\xe9H2-1', ['DIV', 'Ros\xe9DIV-1'], ['Segment1',
'Ros\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]], ['PinkH1-2', ['PinkH2-2',
['DIV', 'PinkDIV2-2'], ['Segment1', 'PinkSegmentDIV1-2']]],
['BlackH1-3', ['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]], ['YellowH1-4', ['YellowH2-4', ['DIV',
'YellowDIV2-4'], ['Segment1', 'YellowSegmentDIV1-4'], ['Segment2',
'YellowSegmentDIV2-4']]]]
- BlackH1-3: [['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]]
- BlackH2-3: [['DIV', 'BlackDIV2-3'], ['Segment1',
'BlackSegmentDIV1-3']]
- DIV: BlackDIV2-3
- Segment1: BlackSegmentDIV1-3
- PinkH1-2: [['PinkH2-2', ['DIV', 'PinkDIV2-2'], ['Segment1',
'PinkSegmentDIV1-2']]]
- PinkH2-2: [['DIV', 'PinkDIV2-2'], ['Segment1',
'PinkSegmentDIV1-2']]
- DIV: PinkDIV2-2
- Segment1: PinkSegmentDIV1-2
- RoséH1-1: [['Ros\xe9H2-1', ['DIV', 'Ros\xe9DIV-1'], ['Segment1', 'Ros
\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]]
- RoséH2-1: [['DIV', 'Ros\xe9DIV-1'], ['Segment1', 'Ros
\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'],
['Segment3', 'Ros\xe9SegmentDIV3-1']]
- DIV: RoséDIV-1
- Segment1: RoséSegmentDIV1-1
- Segment2: RoséSegmentDIV2-1
- Segment3: RoséSegmentDIV3-1
- YellowH1-4: [['YellowH2-4', ['DIV', 'YellowDIV2-4'], ['Segment1',
'YellowSegmentDIV1-4'], ['Segment2', 'YellowSegmentDIV2-4']]]
- YellowH2-4: [['DIV', 'YellowDIV2-4'], ['Segment1',
'YellowSegmentDIV1-4'], ['Segment2', 'YellowSegmentDIV2-4']]
- DIV: YellowDIV2-4
- Segment1: YellowSegmentDIV1-4
- Segment2: YellowSegmentDIV2-4

comp.lang.python

Help Parsing an HTML File