Odysseus
2/4/2008 12:25:00 PM
In article <13qd6ec9vv1qv9a@corp.supernews.com>,
Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
<snip>
> Rather complicated description... A sample of the real/actual input
> /file/ would be useful.
Sorry, I didn't want to go on too long about the background, but I guess
more context would have helped. The data actually come from a web page;
I use a class based on SGMLParser to do the initial collection. The
items in the "names" list were originally "title" attributes of anchor
tags and are obtained with a "start_a" method, while "cells" holds the
contents of the <td> tags, obtained by a "handle_data" method according
to the state of a flag that's set to True by a "start_td" method and to
False by an "end_td". I don't care about anything else on the page, so I
didn't define most of the tag-specific methods available.
<snip>
> cellRoot = 10 * i + na #where did na come from?
> #heck, where do names and cells
> #come from? Globals? Not recommended..
The variable "na" is the number of 'not applicable' items (headings and
whatnot) preceding the data I'm interested in.
I'm not clear on what makes an object global, other than appearing as an
operand of a "global" statement, which I don't use anywhere. But "na" is
assigned its value in the program body, not within any function: does
that make it global? Why is this not recommended? If I wrap the
assignment in a function, making "na" a local variable, how can
"extract_data" then access it?
The lists of data are attributes (?) of my SGMLParser class; in my
misguided attempt to pare irrelevant details from "extract_data" I
obfuscated this aspect. I have a "parse_page(url)" function that returns
an instance of the class, as "captured", and the lists in question are
actually called "captured.names" and "captured.cells". The
"parse_page(url)" function is called in the program body; does that make
its output global as well?
> use
>
> def extract_data(names, na, cells):
>
> and
>
> return <something>
What should it return? A Boolean indicating success or failure? All the
data I want should all have been stored in the "found" dictionary by the
time the function finishes traversing the list of names.
> > for k in ('time', 'score1', 'score2'):
> > v = found[name][k]
> > if v != "---" and v != "n/a": # skip non-numeric data
> > v = ''.join(v.split(",")) # remove commas between 000s
> > found[name][k] = float(v)
>
> I'd suggest splitting this into a short function, and invoking it in
> the preceding... say it is called "parsed"
>
> "time" : parsed(cells[cellRoot + 5]),
Will do. I guess part of my problem is that being unsure of myself I'm
reluctant to attempt too much in a single complex statement, finding it
easier to take small and simple (but inefficient) steps. I'll have to
learn to consolidate things as I go.
> Did you check the library for time/date parsing/formatting
> operations?
>
> >>> import time
> >>> aTime = "03 Feb 2008 20:35:46 UTC" #DD Mth YYYY HH:MM:SS UTC
> >>> time.strptime(aTime, "%d %b %Y %H:%M:%S %Z")
> (2008, 2, 3, 20, 35, 46, 6, 34, 0)
I looked at the documentation for the "time" module, including
"strptime", but I didn't realize the "%b" directive would match the
month abbreviations I'm dealing with. It's described as "Locale's
abbreviated month name"; if someone were to run my program on a French
system e.g., wouldn't it try to find a match among "jan", "fév", ...,
"déc" (or whatever) and fail? Is there a way to declare a "locale" that
will override the user's settings? Are the locale-specific strings
documented anywhere? Can one assume them to be identical in all
English-speaking countries, at least? Now it's pretty unlikely in this
case that such an 'international situation' will arise, but I didn't
want to burn any bridges ...
I was also somewhat put off "strptime" on reading the caveat "Note: This
function relies entirely on the underlying platform's C library for the
date parsing, and some of these libraries are buggy. There's nothing to
be done about this short of a new, portable implementation of
strptime()." If it works, however, it'll be a lot tidier than what I was
doing. I'll make a point of testing it on its own, with a variety of
inputs.
> Note that the %Z is a problematic entry...
> ValueError: time data did not match format: data=03 Feb 2008
> 20:35:46 PST fmt=%d %b %Y %H:%M:%S %Z
All the times are UTC, so fortunately this is a non-issue for my
purposes of the moment. May I assume that leaving the zone out will
cause the time to be treated as UTC?
Thanks for your help, and for bearing with my elementary questions and
my fumbling about.
--
Odysseus