Asp Forum - Re: joining rows - comp.lang.python

Tim Chase

12/29/2007 3:23:00 PM

> A 1
> A 2
> A 3
> B 1
> C 2
> D 3
> D 4
> The result should be
>
> A 1|2|3
> B 1
> C 2
> D 3|4
>
> What should I do to get my results

Well, it depends on whether the resulting order matters. If not,
you can use KM's suggestion of a dictionary:

results = {}
for line in file('in.txt'):
k,v = line.rstrip('\n').split('\t')
results.setdefault(k, []).append(v)
for k,v in results.iteritems():
print k, '|'.join(v)

If, however, order matters, you have to do it in a slightly
buffered manner. It makes a difference when your input looks like

A 1
B 2
A 3

which should yield its own input, rather than "A 1|3". In this case

last_key = ''
values = []
for line in file('in.txt'):
k,v = line.rstrip('\n').split('\t')
if last_key != k:
if last_key:
print last_key, '|'.join(values)
last_key = k
values = [v]
else:
values.append(v)
if last_key:
print last_key, '|'.join(values)

should do the job. Which, if you like, can be reduced to a sed
one-liner

sed ':a;N;s/^\([^\t]\+\)\(.*\)\n\1\t\+\(.*\)/\1\2|\3/;ta;P;D'

....if it doesn't make your modem hang up on you, or if line-noise
assembly-language is your thing ;)

-tkc

3 Answers

Istvan Albert

12/29/2007 6:04:00 PM

On Dec 29, 10:22 am, Tim Chase <python.l...@tim.thechases.com> wrote:

> If, however, order matters, you have to do it in a slightly
> buffered manner.

> Can be reduced to a sed one-liner

I think the original version works just as well for both cases. Your
sed version however does need the order you mention. Makes it no less
mind-bending though ... once I saw it I knew I had to try it :-)

i.

Istvan Albert

12/29/2007 6:14:00 PM

on a second read ... I see that you mean the case that should only
join consecutive lines with the same key

Tim Chase

12/29/2007 6:57:00 PM

> on a second read ... I see that you mean the case that should only
> join consecutive lines with the same key

Yes...there are actually three cases that occur to me:

1) don't care about order, but want one row for each key (1st value)

2) do care about order, and don't want disjoint runs of duplicate
keys to be smashed together

3) do care about order, and do want disjoint runs to be smashed
together (presumably outputting in the key-order as they were
encountered in the file...if not, you'd have to clarify)

My original post addresses #1 and #2, but not #3. Some tweaks to
my solution for #1 should address #3:

results = {}
order = []
for line in file('in.txt'):
k,v = line.rstrip('\n').split('\t')
if k not in results:
order.append(k)
results.setdefault(k, []).append(v)
for k in order:
print k, '|'.join(results[k])

#2 does have the advantage that it can process large (multi-gig)
streams of data without bogging down as it behaves like the sed
version, processing only a window at a time and retaining only
data for consecutively matching lines.

-tkc

comp.lang.python

Re: joining rows

Tim Chase

Istvan Albert

Istvan Albert

Tim Chase

x Login to ForumsZone