Asp Forum - joining strings question

patrick.waldo

2/29/2008 3:09:00 PM

Hi all,

I have some data with some categories, titles, subtitles, and a link
to their pdf and I need to join the title and the subtitle for every
file and divide them into their separate groups.

So the data comes in like this:

data = ['RULES', 'title','subtitle','pdf',
'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']

What I'd like to see is this:

[RULES', 'title subtitle','pdf', 'title1 subtitle1','pdf1'],
['NOTICES','title2 subtitle2','pdf','title3 subtitle3','pdf'], etc...

I've racked my brain for a while about this and I can't seem to figure
it out. Any ideas would be much appreciated.

Thanks

12 Answers

Bruno Desthuilliers

2/29/2008 3:29:00 PM

patrick.waldo@gmail.com a écrit :
> Hi all,
>
> I have some data with some categories, titles, subtitles, and a link
> to their pdf and I need to join the title and the subtitle for every
> file and divide them into their separate groups.
>
> So the data comes in like this:
>
> data = ['RULES', 'title','subtitle','pdf',
> 'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']
>
> What I'd like to see is this:
>
> [RULES', 'title subtitle','pdf', 'title1 subtitle1','pdf1'],
> ['NOTICES','title2 subtitle2','pdf','title3 subtitle3','pdf'], etc...

I don't know where your data come from, but the data structure is
obviously wrong. It should at least be a list of tuples, ie:

data = [
('RULES', [
('title', 'subtitle', 'pdf'),
('title1', 'subtitle1', 'pdf1')
]
),
('NOTICES',[
('title2','subtitle2','pdf',)
('title3','subtitle3','pdf')
]
),
]

> I've racked my brain for a while about this and I can't seem to figure
> it out. Any ideas would be much appreciated.

If possible, fix the code generating the dataset. Any other solution
will be at best a dirty - and brittle - hack.

Tim Chase

2/29/2008 3:35:00 PM

> I have some data with some categories, titles, subtitles, and a link
> to their pdf and I need to join the title and the subtitle for every
> file and divide them into their separate groups.
>
> So the data comes in like this:
>
> data = ['RULES', 'title','subtitle','pdf',
> 'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']
>
> What I'd like to see is this:
>
> [RULES', 'title subtitle','pdf', 'title1 subtitle1','pdf1'],
> ['NOTICES','title2 subtitle2','pdf','title3 subtitle3','pdf'], etc...

The following iterator yields things that look like each of those
items:

def category_iterator(source):
source = iter(source)
last_cat = None
entries = []
try:
while True:
item = source.next()
if item == item.upper(): # categories are uppercase
if last_cat:
yield [last_cat] + entries
last_cat = item
entries = []
else:
title = item
subtitle = source.next()
link = source.next()
entries.append('%s %s' % (title, subtitle))
entries.append(link)
except StopIteration:
if last_cat:
yield [last_cat] + entries

if __name__ == '__main__':
data = ['RULES',
'title','subtitle','pdf',
'title1','subtitle1','pdf1',
'NOTICES',
'title2','subtitle2','pdf',
'title3','subtitle3','pdf']
for compact_category_info in category_iterator(data):
print repr(compact_category_info)

If your input data is malformed, you may get peculiar results
depending on how pathologically malformed that data is.

Hope this helps,

-tkc

Steve Holden

2/29/2008 3:35:00 PM

patrick.waldo@gmail.com wrote:
> Hi all,
>
> I have some data with some categories, titles, subtitles, and a link
> to their pdf and I need to join the title and the subtitle for every
> file and divide them into their separate groups.
>
> So the data comes in like this:
>
> data = ['RULES', 'title','subtitle','pdf',
> 'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']
>
> What I'd like to see is this:
>
> [RULES', 'title subtitle','pdf', 'title1 subtitle1','pdf1'],
> ['NOTICES','title2 subtitle2','pdf','title3 subtitle3','pdf'], etc...
>
> I've racked my brain for a while about this and I can't seem to figure
> it out. Any ideas would be much appreciated.
>
> Thanks

data = ['RULES', 'title','subtitle','pdf',
'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']
olist = []
while data:
if data[0] == data[0].upper():
olist.append([data[0]])
del data[0]
else:
olist[-1].append(data[0]+' '+data[1])
olist[-1].append(data[2])
del data[:3]
print olist

However, I suspect you should be asking yourself whether this is really
an appropriate data structure for your needs. If you say what you are
trying to achieve in the large rather than focusing on a limited
programming issue there may be much better solutions.

I suspect, for example, that a dict indexed by the categories and with
the entries each containing a list of tuples might suit your needs much
better, i.e.

{
'RULES': [('title subtitle', 'pdf'),
('title1 subtitle1', 'pdf')],
'NOTICES': [('title2 subtitle2', 'pdf'),
'title3 subtitle3', 'pdf')]}

One final observation: if all the files are PDFs then you might just as
well throw the 'pdf' strings away and use a constant extension when you
try and open them or whatever :-). Then the lists of tuples i the dict
example could just become lists of strings.

regards
Steve

--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.hold...

Robert Bossy

2/29/2008 3:55:00 PM

Gerard Flanagan

2/29/2008 4:19:00 PM

On Feb 29, 4:09 pm, patrick.wa...@gmail.com wrote:
> Hi all,
>
> I have some data with some categories, titles, subtitles, and a link
> to their pdf and I need to join the title and the subtitle for every
> file and divide them into their separate groups.
>
> So the data comes in like this:
>
> data = ['RULES', 'title','subtitle','pdf',
> 'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']
>
> What I'd like to see is this:
>
> [RULES', 'title subtitle','pdf', 'title1 subtitle1','pdf1'],
> ['NOTICES','title2 subtitle2','pdf','title3 subtitle3','pdf'], etc...
>

For any kind of data partitioning, you should always keep
`itertools.groupby` in mind as a possible solution:

[code]
import itertools as it

data = ['RULES', 'title','subtitle','pdf',
'title1','subtitle1','pdf1',
'NOTICES','title2','subtitle2','pdf',
'title3','subtitle3','pdf']

def partition(s):
return s == s.upper()

#first method
newdata = []

for k,g in it.groupby(data, partition):
if k:
newdata.append(list(g))
else:
newdata[-1].extend(list(g))

for item in newdata:
print item

print

#second method
keys = []
vals = []

for k,g in it.groupby(data, partition):
if k:
keys.append(list(g)[0])
else:
vals.append(list(g))

newdata = dict(zip(keys, vals))

print newdata

[/code]

[output]

['RULES', 'title', 'subtitle', 'pdf', 'title1', 'subtitle1', 'pdf1']
['NOTICES', 'title2', 'subtitle2', 'pdf', 'title3', 'subtitle3',
'pdf']

{'RULES': ['title', 'subtitle', 'pdf', 'title1', 'subtitle1', 'pdf1'],
'NOTICES'
: ['title2', 'subtitle2', 'pdf', 'title3', 'subtitle3', 'pdf']}

[/output]

HTH

Gerard

patrick.waldo

2/29/2008 4:22:00 PM

I tried to make a simple abstraction of my problem, but it's probably
better to get down to it. For the funkiness of the data, I'm
relatively new to Python and I'm either not processing it well or it's
because of BeautifulSoup.

Basically, I'm using BeautifulSoup to strip the tables from the
Federal Register (http://www.access.gpo.gov/su_doc...
cont.html). So far my code strips the html and gets only the
departments I'd like to see. Now I need to put it into an Excel file
(with pyExcelerator) with the name of the record and the pdf. A
snippet from my data from BeautifulSoup like this:

['Environmental Protection Agency', 'RULES', 'Approval and
Promulgation of Air Quality Implementation Plans:', 'Illinois;
Revisions to Emission Reduction Market System, ', '11042 [E8-3800]',
'E8-3800.pdf', 'Ohio; Oxides of Nitrogen Budget Trading Program;
Correction, ', '11192 [Z8-2506]', 'Z8-2506.pdf', 'NOTICES', 'Agency
Information Collection Activities; Proposals, Submissions, and
Approvals, ', '11108-11110 [E8-3934]', 'E8-3934.pdf', 'Data
Availability for Lead National Ambient Air Quality Standard Review, ',
'11110-11111 [E8-3935]', 'E8-3935.pdf', 'Environmental Impacts
Statements; Notice of Availability, ', '11112 [E8-3917]',
'E8-3917.pdf']

What I'd like to see in Excel is this:
'Approval and Promulgation of Air Quality Implementation Plans:
Illinois; Revisions to Emission Reduction Market System, 11042
[E8-3800]' | 'E8-3800.pdf' | RULES
'Ohio; Oxides of Nitrogen Budget Trading Program; Correction, 11192
[Z8-2506]' | 'Z8-2506.pdf' | RULES
'Agency Information Collection Activities; Proposals, Submissions, and
Approvals, 11108-11110 [E8-3934]' | 'E8-3934.pdf' | NOTICES
'Data Availability for Lead National Ambient Air Quality Standard
Review, 11110-11111 [E8-3935]' | 'E8-3935.pdf' | NOTICES
'Environmental Impacts Statements; Notice of Availability, 11112
[E8-3917]' | 'E8-3917.pdf' | NOTICES
etc...for every department I want.

Now that I look at it I've got another problem because 'Approval and
Promulgation of Air Quality Implementation Plans:' should be joined to
both Illinois and Ohio...I love finding these little inconsistencies!
Once I get the data organized with all the titles joined together
appropriately, outputting it to Excel should be relatively easy.

So my problem is how to join these titles together. There are a
couple patterns. Every law is followed by a number, which is always
followed by the pdf.

Any ideas would be much appreciated.

My code so far (excuse the ugliness):

import urllib
import re, codecs, os
import pyExcelerator
from pyExcelerator import *
from BeautifulSoup import BeautifulSoup as BS

#Get the url, make the soup, and get the table to be processed
url = "http://www.access.gpo.gov/su_doc...cont.html"
site = urllib.urlopen(url)
soup = BS(site)
body = soup('table')[1]
tds = body.findAll('td')
mess = []
for td in tds:
mess.append(str(td))
spacer = re.compile(r'<td colspan="4" height="10">.*')
data = []
x=0
for n, t in enumerate(mess):
if spacer.match(t):
data.append(mess[x:n])
x = n

dept = re.compile(r'<td colspan="4">.*')
title = re.compile(r'<td colspan="3">.*')
title2 = re.compile(r'<td colspan="2".*')
link = re.compile(r'<td align="right">.*')
none = re.compile(r'None')

#Strip the html and organize by department
group = []
db_list = []
for d in data:
pre_list = []
for item in d:
if dept.match(item):
dept_soup = BS(item)
try:
dept_contents = dept_soup('a')[0]['name']
pre_list.append(str(dept_contents))
except IndexError:
break
elif title.match(item) or title2.match(item):
title_soup = BS(item)
title_contents = title_soup.td.string
if none.match(str(title_contents)):
pre_list.append(str(title_soup('a')[0]['href']))
else:
pre_list.append(str(title_contents))
elif link.match(item):
link_soup = BS(item)
link_contents = link_soup('a')[1]['href']
pre_list.append(str(link_contents))
db_list.append(pre_list)
for db in db_list:
for n, dash_space in enumerate(db):
dash_space = dash_space.replace('–','-')
dash_space = dash_space.replace(' ', ' ')
db[n] = dash_space
download = re.compile(r'http://.*')
for db in db_list:
for n, pdf in enumerate(db):
if download.match(pdf):
filename = re.split('http://.*/',pdf)
db[n] = filename[1]
#Strip out these departments
AgrDep = re.compile(r'Agriculture Department')
EPA = re.compile(r'Environmental Protection Agency')
FDA = re.compile(r'Food and Drug Administration')
key_data = []
for list in db_list:
for db in list:
if AgrDep.match(db) or EPA.match(db) or FDA.match(db):
key_data.append(list)
#Get appropriate links from covered departments as well
LINK = re.compile(r'^#.*')
links = []
for kd in key_data:
for item in kd:
if LINK.match(item):
links.append(item[1:])
for list in db_list:
for db in list:
if db in links:
key_data.append(list)

print key_data[1]
On Feb 29, 4:35 pm, Steve Holden <st...@holdenweb.com> wrote:
> patrick.wa...@gmail.com wrote:
> > Hi all,
>
> > I have some data with some categories, titles, subtitles, and a link
> > to their pdf and I need to join the title and the subtitle for every
> > file and divide them into their separate groups.
>
> > So the data comes in like this:
>
> > data = ['RULES', 'title','subtitle','pdf',
> > 'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']
>
> > What I'd like to see is this:
>
> > [RULES', 'title subtitle','pdf', 'title1 subtitle1','pdf1'],
> > ['NOTICES','title2 subtitle2','pdf','title3 subtitle3','pdf'], etc...
>
> > I've racked my brain for a while about this and I can't seem to figure
> > it out. Any ideas would be much appreciated.
>
> > Thanks
>
> data = ['RULES', 'title','subtitle','pdf',
> 'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']
> olist = []
> while data:
> if data[0] == data[0].upper():
> olist.append([data[0]])
> del data[0]
> else:
> olist[-1].append(data[0]+' '+data[1])
> olist[-1].append(data[2])
> del data[:3]
> print olist
>
> However, I suspect you should be asking yourself whether this is really
> an appropriate data structure for your needs. If you say what you are
> trying to achieve in the large rather than focusing on a limited
> programming issue there may be much better solutions.
>
> I suspect, for example, that a dict indexed by the categories and with
> the entries each containing a list of tuples might suit your needs much
> better, i.e.
>
> {
> 'RULES': [('title subtitle', 'pdf'),
> ('title1 subtitle1', 'pdf')],
> 'NOTICES': [('title2 subtitle2', 'pdf'),
> 'title3 subtitle3', 'pdf')]}
>
> One final observation: if all the files are PDFs then you might just as
> well throw the 'pdf' strings away and use a constant extension when you
> try and open them or whatever :-). Then the lists of tuples i the dict
> example could just become lists of strings.
>
> regards
> Steve
>
> --
> Steve Holden +1 571 484 6266 +1 800 494 3119
> Holden Web LLC http://www.hold...

I V

2/29/2008 6:56:00 PM

On Fri, 29 Feb 2008 08:18:54 -0800, baku wrote:
> return s == s.upper()

A couple of people in this thread have used this to test for an upper
case string. Is there a reason to prefer it to s.isupper() ?

Tim Chase

2/29/2008 7:09:00 PM

Peter Otten

2/29/2008 7:18:00 PM

I V wrote:

> On Fri, 29 Feb 2008 08:18:54 -0800, baku wrote:
>> return s == s.upper()
>
> A couple of people in this thread have used this to test for an upper
> case string. Is there a reason to prefer it to s.isupper() ?

Note that these tests are not equivalent:

>>> s = "123"
>>> s.isupper(), s.upper() == s
(False, True)

Peter

Steve Holden

2/29/2008 7:52:00 PM

I V wrote:
> On Fri, 29 Feb 2008 08:18:54 -0800, baku wrote:
>> return s == s.upper()
>
> A couple of people in this thread have used this to test for an upper
> case string. Is there a reason to prefer it to s.isupper() ?

In my case you can put it down to ignorance or forgetfulness, depending
on how forgiving you feel.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.hold...

comp.lang.python

joining strings question

patrick.waldo

Bruno Desthuilliers

Tim Chase

Steve Holden

Robert Bossy

Gerard Flanagan

patrick.waldo

I V

Tim Chase

Peter Otten

Steve Holden

x Login to ForumsZone