[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Big time WTF with generators - bug?

James Stroud

2/13/2008 8:35:00 AM

Hello,

I'm boggled.

I have this function which takes a keyer that keys a table (iterable). I
filter based on these keys, then groupby based on the filtered keys and
a keyfunc. Then, to make the resulting generator behave a little nicer
(no requirement for user to unpack the keys), I strip the keys in a
generator expression that unpacks them and generates the k,g pairs I
want ("regrouped"). I then append the growing list of series generator
into the "serieses" list ("serieses" is plural for series if your
vocablulary isn't that big).

Here's the function:

def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
serieses = []
for s_name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
serieses.append((s_name, regrouped))
for s in serieses:
yield s


I defined a little debugging function called iterprint:

def iterprint(thing):
if isinstance(thing, str):
print thing
elif hasattr(thing, 'items'):
print thing.items()
else:
try:
for x in thing:
iterprint(x)
except TypeError:
print thing

The gist is that iterprint will print any generator down to its
non-iterable components--it works fine for my purpose here, but I
included the code for the curious.

When I apply iterprint in the following manner (only change is the
iterprint line) everything looks fine and my "regrouped" generators in
"serieses" generate what they are supposed to when iterprinting. The
iterprint at this point shows that everything is working just the way I
want (I can see the last item in "serieses" iterprints just fine).

def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
serieses = []
for s_name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
serieses.append((s_name, regrouped))
iterprint(serieses)
for s in serieses:
yield s

Now, here's the rub. When I apply iterprint in the following manner, it
looks like my generator ("regrouped") gets consumed (note the only
change is a two space de-dent of the iterprint call--the printing is
outside the loop):

def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
serieses = []
for s_name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
serieses.append((s_name, regrouped))
iterprint(serieses)
for s in serieses:
yield s

Now, what is consuming my "regrouped" generator when going from inside
the loop to outside?

Thanks in advance for any clue.

py> print version
2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)]

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.james...
16 Answers

Paul Rubin

2/13/2008 8:42:00 AM

0

James Stroud <jstroud@mbi.ucla.edu> writes:
> I defined a little debugging function called iterprint:
>
> def iterprint(thing): ...
> for x in thing:
> iterprint(x)

of course this mutates the thing that is being printed. Try using
itertools.tee to fork a copy of the iterator and print from that.
I didn't look at the rest of your code enough to spot any errors
but take note of the warnings in the groupby documentation about
pitfalls with using the results some number of times other than
exactly once.

James Stroud

2/13/2008 8:55:00 AM

0

Paul Rubin wrote:
> James Stroud <jstroud@mbi.ucla.edu> writes:
>> I defined a little debugging function called iterprint:
>>
>> def iterprint(thing): ...
>> for x in thing:
>> iterprint(x)
>
> of course this mutates the thing that is being printed. Try using
> itertools.tee to fork a copy of the iterator and print from that.
> I didn't look at the rest of your code enough to spot any errors
> but take note of the warnings in the groupby documentation about
> pitfalls with using the results some number of times other than
> exactly once.

Thank you for your answer, but I am aware of this caveat. Something is
consuming my generator *before* I iterprint it. Please give it another
look if you would be so kind.

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.james...

James Stroud

2/13/2008 8:57:00 AM

0

Paul Rubin wrote:
> James Stroud <jstroud@mbi.ucla.edu> writes:
>> I defined a little debugging function called iterprint:
>>
>> def iterprint(thing): ...
>> for x in thing:
>> iterprint(x)
>
> of course this mutates the thing that is being printed. Try using
> itertools.tee to fork a copy of the iterator and print from that.
> I didn't look at the rest of your code enough to spot any errors
> but take note of the warnings in the groupby documentation about
> pitfalls with using the results some number of times other than
> exactly once.

I can see I didn't explain so well. This one must be a bug if my code
looks good to you. Here is a summary:

- If I iterprint inside the loop, iterprint looks correct.
- If I iterprint outside the loop, my generator gets consumed and I am
only left with the last item, so my iterprint prints only one item
outside the loop.

Conclusion: something consumes my generator going from inside the loop
to outside.

Please note that I am not talking about the yielded values, or the
for-loop that creates them. I left them there to show my intent with the
function. The iterprint function is there to show that the generator
gets consumed just moving from inside the loop to outside.

I know this one is easy to dismiss to my consuming the generator with
the iterprint, as this would be a common mistake.

James


--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.james...

Paul Rubin

2/13/2008 9:08:00 AM

0

James Stroud <jstroud@mbi.ucla.edu> writes:
> Thank you for your answer, but I am aware of this caveat. Something is
> consuming my generator *before* I iterprint it. Please give it another
> look if you would be so kind.

I'll see if I can look at it some more later, I'm in the middle of
something else right now. All I can say at the moment is that I've
encountered problems like this in my own code many times, and it's
always been a matter of having to carefully keep track of how the
nested iterators coming out of groupby are being consumed. I doubt
there is a library bug. Using groupby for things like this is
powerful, but unfortunately bug-prone because of how these mutable
iterators work. I suggest making some sample sequences and stepping
through with a debugger seeing just how the iterators advance.

Paul Rubin

2/13/2008 9:10:00 AM

0

James Stroud <jstroud@mbi.ucla.edu> writes:
> I can see I didn't explain so well. This one must be a bug if my code
> looks good to you.

I didn't spot any obvious errors, but I didn't look closely enough
to say that the code looked good or bad.

> Conclusion: something consumes my generator going from inside the loop
> to outside.

I'm not so sure of this, the thing is you're building these internal
grouper objects that don't expect to be consumed in the wrong order, etc.

Really, I'd try making a test iterator that prints something every
time you advance it, then step through your function with a debugger.

James Stroud

2/13/2008 9:15:00 AM

0

Paul Rubin wrote:
> James Stroud <jstroud@mbi.ucla.edu> writes:
>> I can see I didn't explain so well. This one must be a bug if my code
>> looks good to you.
>
> I didn't spot any obvious errors, but I didn't look closely enough
> to say that the code looked good or bad.
>
>> Conclusion: something consumes my generator going from inside the loop
>> to outside.
>
> I'm not so sure of this, the thing is you're building these internal
> grouper objects that don't expect to be consumed in the wrong order, etc.
>
> Really, I'd try making a test iterator that prints something every
> time you advance it, then step through your function with a debugger.

Thank you for your suggestion. I replied twice to your first post before
you made your suggestion to step through with a debugger, so it looks
like I ignored it.

Thanks again.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.james...

Peter Otten

2/13/2008 9:32:00 AM

0

James Stroud wrote:

groupby() is "all you can eat", but "no doggy bag".

> def serialize(table, keyer=_keyer,
>                       selector=_selector,
>                       keyfunc=_keyfunc,
>                       series_keyfunc=_series_keyfunc):
>    keyed = izip(imap(keyer, table), table)
>    filtered = ifilter(selector, keyed)
>    serialized = groupby(filtered, series_keyfunc)
>    serieses = []
>    for s_name, series in serialized:
>      grouped = groupby(series, keyfunc)
>      regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
>      serieses.append((s_name, regrouped))

You are trying to store a group for later consumption here.

>    for s in serieses:
>      yield s

That doesn't work:

>>> groups = [g for k, g in groupby(range(10), lambda x: x//3)]
>>> for g in groups:
.... print list(g)
....
[]
[]
[]
[9]

You cannot work around that because what invalidates a group is the call of
groups.next():

>>> groups = groupby(range(10), lambda x: x//3)
>>> g = groups.next()[1]
>>> g.next()
0
>>> groups.next()
(1, <itertools._grouper object at 0x2b3bd1f300f0>)
>>> g.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration

Perhaps Python should throw an out-of-band exception for an invalid group
instead of yielding bogus data.

Peter

Paul Rubin

2/13/2008 9:42:00 AM

0

Peter Otten <__peter__@web.de> writes:
> >    for s_name, series in serialized:
> >      grouped = groupby(series, keyfunc)
> >      regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
> >      serieses.append((s_name, regrouped))
>
> You are trying to store a group for later consumption here.

Good catch, the solution is to turn that loop into a generator,
but then it has to be consumed very carefully. This stuff
maybe presses the limits of what one can do with Python iterators
while staying sane.

James Stroud

2/13/2008 9:48:00 AM

0

Peter Otten wrote:
> groupby() is "all you can eat", but "no doggy bag".

Thank you for your clear explanation--a satisfying conclusion to nine
hours of head scratching.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.james...

James Stroud

2/13/2008 10:15:00 AM

0

Paul Rubin wrote:
> Peter Otten <__peter__@web.de> writes:
>> You are trying to store a group for later consumption here.
>
> Good catch, the solution is to turn that loop into a generator,
> but then it has to be consumed very carefully.

Brilliant suggestion. Worked like a charm. Here is the final product:


def dekeyed(serialized, keyfunc):
for name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
yield (name, regrouped)

def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
return dekeyed(serialized, keyfunc)


Thank you!

James



--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.james...