Dennis Lee Bieber
2/6/2008 6:34:00 PM
On Tue, 5 Feb 2008 18:54:49 -0800 (PST), Tess <testone@gmail.com>
declaimed the following in comp.lang.python:
>
> file 1:
> <item>TABLE</table>
As commented by others, the <item>....</table> pairing looks WRONG;
shouldn't it be <item>...</item>
> <color>black</color>
> <color>blue</color>
> <color>red</color>
> <item>CHAIR</table>
> <color>yellow</color>
> <color>black</color>
> <color>red</color>
> <item>TABLE</table>
> <color>white</color>
> <color>gray</color>
> <color>pink</color>
>
Or even more completely...
<table><item>...</item><color>...</color><color>...</color></table>
Ah well... Given all the various re-based solutions (I have yet to
find a use for regular expressions) let's toss in a different type of
complexity. You'll notice the work-arounds that were needed to handle
that mismatch of <item> and </table>
-=-=-=-=-=-=-
import sgmllib
import csv
# Format still looks wrong... I'd expect </item>, not </table>
SAMPLE = """<item>TABLE</table>
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>TABLE</table>
<color>white</color>
<color>gray</color>
<color>pink</color>
"""
class HardWay(sgmllib.SGMLParser):
def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.record = []
self.inTag = False
self.tsvfid = "PARSED%s.tsv" % id(self)
self.tsv = open(self.tsvfid, "wb")
self.writer = csv.writer(self.tsv, delimiter="\t")
def do_item(self, attrs): #if </item> is correct, this should be
start_item(...)
if self.record:
self.writer.writerow(self.record)
self.record = []
self.inTag = True
# parser reports an unbalanced closing tag for </table>!
# if, OTOH, </item> is proper, use end_item()
## def end_table(self):
## self.inTag = False
def report_unbalanced(self, tag):
self.inTag = False
def start_color(self, attrs):
self.inTag = True
def end_color(self):
self.inTag = False
def handle_data(self, text):
if self.inTag and text.strip(): #don't write stray glitches from
unmatched
self.record.append(text.strip())
def close(self):
if self.record:
self.writer.writerow(self.record)
self.tsv.close()
sgmllib.SGMLParser.close(self)
if __name__ == "__main__":
parser = HardWay()
parser.feed(SAMPLE)
parser.close()
-=-=-=-=-=-=-=-
Output file:
-=-=-=-=-=-
TABLE black blue red
CHAIR yellow black red
TABLE white gray pink
-=-=-=-=-=-
--
Wulfraed Dennis Lee Bieber KD6MOG
wlfraed@ix.netcom.com wulfraed@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: web-asst@bestiaria.com)
HTTP://www.bestiaria.com/