Asp Forum - GC is very expensive: am I doing something wrong?

Weeble

3/19/2010 12:14:00 AM

I am loading a dictionary from a text file and constructing a trie
data structure in memory. However, it takes longer than I'm happy with
- about 12 seconds on my computer. I profiled it, came up with some
clever ideas to cut down on the work (such as by exploiting the fact
that the dictionary is sorted) and was only able to shave a small
fraction of the time off. However, then I tried calling gc.disable()
before loading the trie and it halved the running time! I was
surprised. Is that normal? I thought that the cost of garbage
collection would be in some way proportional to the amount of garbage
created, but I didn't think I was creating any: as far as I can tell
the only objects deallocated during the load are strings, which could
not be participating in cycles.

I have spent a lot of time in C#, where the standard advice is not to
mess about with the garbage collector because you'll probably just
make things worse. Does that advice apply in Python? Is it a bad idea
to call gc.disable() before loading the trie and then enable it again
afterwards? Does the fact that the garbage collector is taking so much
time indicate I'm doing something particularly bad here? Is there some
way to give the garbage collector a hint that the whole trie is going
to be long-lived and get it promoted straight to generation 2 rather
than scanning it over and over?

$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real 0m12.523s
user 0m12.380s
sys 0m0.140s
$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real 0m12.592s
user 0m12.480s
sys 0m0.110s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real 0m6.176s
user 0m5.980s
sys 0m0.190s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"

real 0m6.331s
user 0m5.530s
sys 0m0.170s

=== trie.py ===

class Trie(object):
__slots__=("root", "active")
def __init__(self):
self.root=[]
self.active=False
def insert(self, word):
if len(word) == 0:
self.active=True
else:
head = word[0]
for ch, child in reversed(self.root):
if ch == head:
child.insert(word[1:])
return
child = Trie()
self.root.append((head, child))
child.insert(word[1:])
def seek(self, word):
if len(word) == 0:
return self
head = word[0]
for ch, child in self.root:
if ch == head:
return child.seek(word[1:])
return EMPTY_TRIE
def load(self, file):
for line in file:
self.insert(line.strip().lower())
def empty(self):
return (not self.root) and not self.active
def endings(self, prefix=""):
if self.active:
yield prefix
for ch, child in self.root:
for ending in child.endings(prefix+ch):
yield ending

EMPTY_TRIE = Trie()

14 Answers

Terry Reedy

3/19/2010 12:34:00 AM

On 3/18/2010 8:13 PM, Weeble wrote:
> I thought that the cost of garbage
> collection would be in some way proportional to the amount of garbage
> created, but I didn't think I was creating any: as far as I can tell
> the only objects deallocated during the load are strings, which could
> not be participating in cycles.
>
> I have spent a lot of time in C#, where the standard advice is not to
> mess about with the garbage collector because you'll probably just
> make things worse. Does that advice apply in Python? Is it a bad idea
> to call gc.disable() before loading the trie and then enable it again
> afterwards?

I believe not. It is known that certain patterns of object creation and
destruction can lead to bad gc behavior. No one has discovered a setting
of the internal tuning parameters for which there are no bad patterns
and I suspect there are not any such. This does not negate Xavier's
suggestion that a code change might also solve your problem.

tjr

Patrick Maupin

3/19/2010 4:44:00 AM

On Mar 18, 7:13 pm, Weeble <clockworksa...@gmail.com> wrote:
> I am loading a dictionary from a text file and constructing a trie
> data structure in memory. However, it takes longer than I'm happy with
> - about 12 seconds on my computer. I profiled it, came up with some
> clever ideas to cut down on the work (such as by exploiting the fact
> that the dictionary is sorted) and was only able to shave a small
> fraction of the time off. However, then I tried calling gc.disable()
> before loading the trie and it halved the running time! I was
> surprised. Is that normal? I thought that the cost of garbage
> collection would be in some way proportional to the amount of garbage
> created, but I didn't think I was creating any: as far as I can tell
> the only objects deallocated during the load are strings, which could
> not be participating in cycles.

Well, you are creating and destroying a lot of objects in the process,
so that will provoke the garbage collector. But you are also doing
reversing and searching, and that's slow. Does your application
really need to be able to keep things in order like this, or do you
just want to know if a word is in the dictionary? If you just want to
load up a dictionary and be able to see if words are in it, I would
use a dict instead of a list.
Even if you want to be able to print out the data in order, you might
consider using a dict instead of a list. The print operation could
use one sort at the end, instead of searching all the nodes on
insertion.

You could also use a character that doesn't appear in your data as a
sentinel, and then you don't need a separate active indicator -- every
leaf node will be empty and be referred to by the sentinel character.

You are also doing a lot of recursive operations that could be done
without recursing.

Finally, do you really need to keep an additional object around for
each node in the tree?

I have modified your trie code to use a dict and a sentinel, while
keeping basically the same API. This may or may not be the best way
to do this, depending on your other code which uses this data
structure. It could also probably be made faster by removing the
setdefault, and not re-building objects when you need them, and even
this implementation will load faster if you disable the gc, but in any
case, this might give you some ideas about how to make your code go
faster.

Regards,
Pat

from collections import defaultdict

class TrieTop(object):
sentinel = ' ' # Something not in the data

def __init__(self, data=None):
def defaultrecurse():
return defaultdict(defaultrecurse)
if data is None:
data = defaultrecurse()
self.data = data
def insert(self, word):
data = self.data
for ch in word:
data = data[ch]
data[self.sentinel]
def seek(self, word):
data = self.data
for ch in word:
data = data.get(ch)
if data is None:
return EMPTY_TRIE
return TrieTop(data)
def load(self, file):
for line in file:
self.insert(line.strip().lower())
def empty(self):
return (not self.data)
def endings(self, prefix=""):
def recurse(data, prefix):
if not data:
yield prefix[:-1]
return
for ch, data in data.iteritems():
for result in recurse(data, prefix + ch):
yield result
return sorted(recurse(self.data, prefix))

EMPTY_TRIE = TrieTop()

Lawrence D'Oliveiro

3/21/2010 11:37:00 PM

In message <mailman.963.1268958842.23598.python-list@python.org>, Terry
Reedy wrote:

> No one has discovered a setting
> of the internal tuning parameters for which there are no bad patterns
> and I suspect there are not any such. This does not negate Xavier's
> suggestion that a code change might also solve your problem.

Could it be that for implementing a structure like a trie as the OP is,
where a lot of CPU cycles can be spent manipulating the structure, a high-
level language like Python, Perl or Ruby just gets in the way?

My feeling would be, try to get the language to do as much of the work for
you as possible. If you canâ??t do that, then you might be better off with a
lower-level language.

Stefan Behnel

3/22/2010 7:42:00 AM

Lawrence D'Oliveiro, 22.03.2010 00:36:
> Terry Reedy wrote:
>> No one has discovered a setting
>> of the internal tuning parameters for which there are no bad patterns
>> and I suspect there are not any such. This does not negate Xavier's
>> suggestion that a code change might also solve your problem.
>
> Could it be that for implementing a structure like a trie as the OP is,
> where a lot of CPU cycles can be spent manipulating the structure, a high-
> level language like Python, Perl or Ruby just gets in the way?

I would rather say that the specific problem of the trie data structure is
that it has extremely little benefit over other available data structures.
There may still be a couple of niches where it makes sense to consider it
as an alternative, but given that dicts are so heavily optimised in Python,
it'll be hard for tries to compete even when written in a low-level language.

Remember that the original use case was to load a dictionary from a text
file. For this use case, a trie can be very wasteful in terms of memory and
rather CPU cache unfriendly on traversal, whereas hash values are a) rather
fast to calculate for a string, and b) often just calculated once and then
kept alive in the string object for later reuse.

> My feeling would be, try to get the language to do as much of the work for
> you as possible. If you canâ??t do that, then you might be better off with a
> lower-level language.

I agree with the first sentence, but I'd like to underline the word 'might'
in the second. As this newsgroup shows, very often it's enough to look for
a better algorithmic approach first.

Stefan

tan

3/22/2010 11:40:00 PM

In article <mailman.1060.1269243742.23598.python-list@python.org>, Stefan Behnel <stefan_ml@behnel.de> wrote:
>Lawrence D'Oliveiro, 22.03.2010 00:36:
>> Terry Reedy wrote:
>>> No one has discovered a setting
>>> of the internal tuning parameters for which there are no bad patterns
>>> and I suspect there are not any such. This does not negate Xavier's
>>> suggestion that a code change might also solve your problem.
>>
>> Could it be that for implementing a structure like a trie as the OP is,
>> where a lot of CPU cycles can be spent manipulating the structure, a high-
>> level language like Python, Perl or Ruby just gets in the way?
>
>I would rather say that the specific problem of the trie data structure is
>that it has extremely little benefit over other available data structures.

Not true.

>There may still be a couple of niches where it makes sense to consider it
>as an alternative, but given that dicts are so heavily optimised in Python,
>it'll be hard for tries to compete even when written in a low-level language.

It depends. If your data is not in nearly sorted order,
trees are some of the best mechanisms available.

>Remember that the original use case was to load a dictionary from a text
>file. For this use case, a trie can be very wasteful in terms of memory and
>rather CPU cache unfriendly on traversal, whereas hash values are a) rather
>fast to calculate for a string, and b) often just calculated once and then
>kept alive in the string object for later reuse.

You still have to walk the bucket in a hash map/table.
Performance may be orders of magnitude worse than for trees.

>> My feeling would be, try to get the language to do as much of the work for
>> you as possible. If you canâ??t do that, then you might be better off with a
>> lower-level language.
>
>I agree with the first sentence, but I'd like to underline the word 'might'
>in the second. As this newsgroup shows, very often it's enough to look for
>a better algorithmic approach first.
>
>Stefan
>

--
You want to know who you are?

http://osho.../Convert/...

Most Osho books on line:

http://osho...

Antoine Pitrou

3/23/2010 2:12:00 AM

Le Mon, 22 Mar 2010 23:40:16 +0000, tan a Ã©critÂ :
>
>>Remember that the original use case was to load a dictionary from a text
>>file. For this use case, a trie can be very wasteful in terms of memory
>>and rather CPU cache unfriendly on traversal, whereas hash values are a)
>>rather fast to calculate for a string, and b) often just calculated once
>>and then kept alive in the string object for later reuse.
>
> You still have to walk the bucket in a hash map/table. Performance may
> be orders of magnitude worse than for trees.

"walk the bucket" shouldn't be a significant cost factor here, especially
if you are doing meaningful work with the traversed items. In the CPython
implementation the total hash table size is less than a constant times
the number of actual items. Moreover, it's a linear scan over an array
rather than having to dereference pointers as in tree.

"Orders of magnitude worse", in any case, sounds very exaggerated.

(and, of course, as the OP said, there's the major difference that the
dict type is implemented in C, which makes constant factors an order of
magnitude smaller than for a Python trie implementation)

Paul Rubin

3/23/2010 5:06:00 AM

Antoine Pitrou <solipsis@pitrou.net> writes:
> "Orders of magnitude worse", in any case, sounds very exaggerated.

The worst case can lose orders of magnitude if a lot of values hash
to the same bucket.

Steven D'Aprano

3/23/2010 6:32:00 AM

On Mon, 22 Mar 2010 22:05:40 -0700, Paul Rubin wrote:

> Antoine Pitrou <solipsis@pitrou.net> writes:
>> "Orders of magnitude worse", in any case, sounds very exaggerated.
>
> The worst case can lose orders of magnitude if a lot of values hash to
> the same bucket.

Well, perhaps one order of magnitude.

>>> for i in xrange(100):
.... n = 32*i+1
.... assert hash(2**n) == hash(2)
....
>>> d1 = dict.fromkeys(xrange(100))
>>> d2 = dict.fromkeys([2**(32*i+1) for i in xrange(100)])
>>>
>>> from timeit import Timer
>>> setup = "from __main__ import d1, d2"
>>> t1 = Timer("for k in d1.keys(): x = d1[k]", setup)
>>> t2 = Timer("for k in d2.keys(): x = d2[k]", setup)
>>>
>>> min(t1.repeat(number=1000, repeat=5))
0.026707887649536133
>>> min(t2.repeat(number=1000, repeat=5))
0.33103203773498535

--
Steven

Paul Rubin

3/23/2010 8:18:00 AM

Steven D'Aprano <steven@REMOVE.THIS.cybersource.com.au> writes:
> Well, perhaps one order of magnitude.
>>>> for i in xrange(100):
> ... n = 32*i+1
> ... assert hash(2**n) == hash(2)

Try with much more than 100 items (you might want to construct the
entries a little more intricately to avoid such big numbers). The point
is that access becomes O(N) instead of O(1). See:

http://www.cs.rice.edu/~scr...

for the consequences. http://cr.yp.to/cr... discusses the
issue a little more.

Peter Otten

3/23/2010 8:44:00 AM

Steven D'Aprano wrote:

> On Mon, 22 Mar 2010 22:05:40 -0700, Paul Rubin wrote:
>
>> Antoine Pitrou <solipsis@pitrou.net> writes:
>>> "Orders of magnitude worse", in any case, sounds very exaggerated.
>>
>> The worst case can lose orders of magnitude if a lot of values hash to
>> the same bucket.
>
>
> Well, perhaps one order of magnitude.
>
>
>>>> for i in xrange(100):
> ... n = 32*i+1
> ... assert hash(2**n) == hash(2)
> ...
>>>> d1 = dict.fromkeys(xrange(100))
>>>> d2 = dict.fromkeys([2**(32*i+1) for i in xrange(100)])
>>>>
>>>> from timeit import Timer
>>>> setup = "from __main__ import d1, d2"
>>>> t1 = Timer("for k in d1.keys(): x = d1[k]", setup)
>>>> t2 = Timer("for k in d2.keys(): x = d2[k]", setup)
>>>>
>>>> min(t1.repeat(number=1000, repeat=5))
> 0.026707887649536133
>>>> min(t2.repeat(number=1000, repeat=5))
> 0.33103203773498535

But the ratio grows with the number of collisions:

$ python extrapolate.py
10
0.00120401382446
0.00753307342529
ratio: 6.25663366337

100
0.00542402267456
0.316139936447
ratio: 58.2851428571

1000
0.00553417205811
3.36690688133
ratio: 608.384930209

$ cat extrapolate.py
from timeit import Timer

class Item(object):
def __init__(self, value, hash=None):
self.value = value
self.hash = value if hash is None else hash
def __eq__(self, other):
return self.value == other.value
def __hash__(self):
return self.hash

setup = "from __main__ import d"
bench = "for k in d: x = d[k]"

for n, number in (10,100), (100,100), (1000,10):
print n
d1 = dict.fromkeys(Item(i) for i in xrange(n))
d2 = dict.fromkeys(Item(i, 0) for i in xrange(n))
ab = []
for d in d1, d2:
t = Timer(bench, setup)
ab.append(min(t.repeat(number=number, repeat=3)))
print ab[-1]
print "ratio:", ab[1]/ab[0]
print

See also http://xkc...

Peter

comp.lang.python

GC is very expensive: am I doing something wrong?

Weeble

Terry Reedy

Patrick Maupin

Lawrence D'Oliveiro

Stefan Behnel

tan

Antoine Pitrou

Paul Rubin

Steven D'Aprano

Paul Rubin

Peter Otten

x Login to ForumsZone