Asp Forum - dict.get and str.xsplit

Bearophile

2/26/2008 2:02:00 PM

When I use dictionaries I find the dict.get() method very handy, and
I'd like to use it often, but I think it's quite slow. This is a small
benchmark:

from random import randrange
from timeit import default_timer as clock

def main(N, M):
keys = list(set(randrange(2*N) for n in xrange(N)))
n2 = len(keys) / 2
keys1, keys2 = keys[:n2], keys[n2:]
adict = dict.fromkeys(keys1)

t = clock()
for _ in xrange(M):
for k in keys1:
r = adict.get(k, None)
for k in keys2:
r = adict.get(k, None)
print round(clock() - t, 2), "s"

t = clock()
for _ in xrange(M):
for k in keys1:
if k in adict:
r = adict[k]
else:
r = None
for k in keys2:
if k in adict:
r = adict[k]
else:
r = None
print round(clock() - t, 2), "s"

main(40000, 30)

On my old PC the output is:
2.36 s
1.59 s
(Your timings may be 4-5 times smaller, so you can tweak N and M to
have significant results).
And note that the if/then version is faster even if it calls the hash
function two times, while get() is supposed to do it once (in this
case the hashing function is very simple and fast. If the keys are
long strings the results of a similar benchmark may be different,
because the computing of the hashing function may become the slowest
part of that code).

This is a real difference, that has real impact on the programs I
write, so I often use the if/else approach, despite the dict.get()
method being semantically fitter and shorter.
So can the dict.get() method be speed up? And if not, why?

------------------

Another method I'd like to see, is the str.xsplit() I was talking
about time ago too. It's like str.split() but it's lazy. It avoids to
build a potentially huge list. In real D programs I have seen such
string function may speed up heavy-string-processing code (the kind of
code you may want to use Python/Perl/Ruby for) up to 2-3 times. Seeing
how such kind of processing is common in Python I think this isn't an
optimization to ignore.
(I am now getting better in writing C code, so in few months I may
even become able to submit a xsplit patch myself. I think it's not too
much complex to write, it can borrow lot of code from the str.split
method).

Bye, and thank you for your kind attention,
bearophile

10 Answers

Marc 'BlackJack' Rintsch

2/26/2008 2:15:00 PM

On Tue, 26 Feb 2008 06:02:12 -0800, bearophileHUGS wrote:

> This is a real difference, that has real impact on the programs I
> write, so I often use the if/else approach, despite the dict.get()
> method being semantically fitter and shorter.
> So can the dict.get() method be speed up? And if not, why?

I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::

adict_get = adict.get
for _ in xrange(M):
for k in keys1:
r = adict_get(k, None)
for k in keys2:
r = adict_get(k, None)

Ciao,
Marc 'BlackJack' Rintsch

Aaron Brady

2/26/2008 2:33:00 PM

On Feb 26, 8:14 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
> On Tue, 26 Feb 2008 06:02:12 -0800, bearophileHUGS wrote:
> > This is a real difference, that has real impact on the programs I
> > write, so I often use the if/else approach, despite the dict.get()
> > method being semantically fitter and shorter.
> > So can the dict.get() method be speed up? And if not, why?
>
> I guess it's the method lookup that's the slow part. Factor it out of the
> loop and measure again::
>
> adict_get = adict.get
> for _ in xrange(M):
> for k in keys1:
> r = adict_get(k, None)
> for k in keys2:
> r = adict_get(k, None)
>
> Ciao,
> Marc 'BlackJack' Rintsch

Can't be. The string 'get' is only hashed once, since it's hard-coded
into the script, and looking it up can't be any slower than looking up
__getitem__.

Steve Holden

2/26/2008 2:34:00 PM

Marc 'BlackJack' Rintsch wrote:
> On Tue, 26 Feb 2008 06:02:12 -0800, bearophileHUGS wrote:
>
>> This is a real difference, that has real impact on the programs I
>> write, so I often use the if/else approach, despite the dict.get()
>> method being semantically fitter and shorter.
>> So can the dict.get() method be speed up? And if not, why?
>
> I guess it's the method lookup that's the slow part. Factor it out of the
> loop and measure again::
>
> adict_get = adict.get
> for _ in xrange(M):
> for k in keys1:
> r = adict_get(k, None)
> for k in keys2:
> r = adict_get(k, None)

And think about using the timeit module, since it's been included among
the batteries for your convenience, and it removes some common pitfalls
with cross-platform timings.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.hold...

Marc 'BlackJack' Rintsch

2/26/2008 2:39:00 PM

On Tue, 26 Feb 2008 06:33:01 -0800, castironpi wrote:

> On Feb 26, 8:14Â am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
>> On Tue, 26 Feb 2008 06:02:12 -0800, bearophileHUGS wrote:
>> > This is a real difference, that has real impact on the programs I
>> > write, so I often use the if/else approach, despite the dict.get()
>> > method being semantically fitter and shorter.
>> > So can the dict.get() method be speed up? And if not, why?
>>
>> I guess it's the method lookup that's the slow part. Â Factor it out of the
>> loop and measure again::
>>
>> Â Â adict_get = adict.get
>> Â Â for _ in xrange(M):
>> Â Â Â Â for k in keys1:
>> Â Â Â Â Â Â r = adict_get(k, None)
>> Â Â Â Â for k in keys2:
>> Â Â Â Â Â Â r = adict_get(k, None)
>>
>> Ciao,
>> Â Â Â Â Marc 'BlackJack' Rintsch
>
> Can't be. The string 'get' is only hashed once, since it's hard-coded
> into the script, and looking it up can't be any slower than looking up
> __getitem__.

Within functions it is faster. In the original code the `get` attribute is
looked up on the `adict` object twice in each loop iteration via hashing.
In my code it is looked up once before the loop and within the loop the
local name `adict_get` isn't looked up but hardcoded as index into the
internal locals array of the function.

Ciao,
Marc 'BlackJack' Rintsch

Bearophile

2/26/2008 2:40:00 PM

Marc 'BlackJack' Rintsch:

> I guess it's the method lookup that's the slow part. Factor it out of the
> loop and measure again::

I did know of that optimization, but sometimes I "forget" about it...
The new timings:

Output timings, best of 3, unloaded CPU:
2.32 s with adict.get
1.56 s with "in" + if/then/else
1.78 s with adict_get

It's slower still, despite doing the lookup two times half of the
times (half keys are present in the test set). The "in" is an
operator, it's faster than a method call, but I don't understand the
other details. Now the difference between 1.78 and 1.56 is small
enough, so probably now it's not much noticeable in practice in real
programs, so my original argument is mostly dead now :-) In the code
points where speed matters instead of a single line with a get() I can
use two lines to create a local adict_get.

Bye and thank you,
bearophile

Aaron Brady

2/26/2008 2:48:00 PM

On Feb 26, 8:40 am, bearophileH...@lycos.com wrote:
> Marc 'BlackJack' Rintsch:
>
> > I guess it's the method lookup that's the slow part. Factor it out of the
> > loop and measure again::
>
> I did know of that optimization, but sometimes I "forget" about it...
> The new timings:
>
> Output timings, best of 3, unloaded CPU:
> 2.32 s with adict.get
> 1.56 s with "in" + if/then/else
> 1.78 s with adict_get
>
> It's slower still, despite doing the lookup two times half of the
> times (half keys are present in the test set). The "in" is an
> operator, it's faster than a method call, but I don't understand the
> other details. Now the difference between 1.78 and 1.56 is small
> enough, so probably now it's not much noticeable in practice in real
> programs, so my original argument is mostly dead now :-) In the code
> points where speed matters instead of a single line with a get() I can
> use two lines to create a local adict_get.
>
> Bye and thank you,
> bearophile

Where does adict_getitem fit in? And try: except: KeyError?

marek.rocki

2/26/2008 3:04:00 PM

bearophileH...@lycos.com napisal(a):
> It's slower still, despite doing the lookup two times half of the
> times (half keys are present in the test set). The "in" is an
> operator, it's faster than a method call, but I don't understand the
> other details. Now the difference between 1.78 and 1.56 is small
> enough, so probably now it's not much noticeable in practice in real
> programs, so my original argument is mostly dead now :-) In the code
> points where speed matters instead of a single line with a get() I can
> use two lines to create a local adict_get.

AFAIK, method/function call is generally slow in Python (similarly to
the dot attribute lookup). As for the original prooblem, why not use
defaultdict? I think it's the most idiomatic way to achieve what we
want. And also the fastest one, according to my quick-and-dirty tests:

from collections import defaultdict

def main(N, M):

# <snip>

addict = defaultdict(lambda: None, adict)
t = clock()
for _ in xrange(M):
for k in keys1:
r = addict[k]
for k in keys2:
r = addict[k]
print round(clock() - t, 2), "s"

adict.get: 3.12 s
adict_get: 2.24 s
in: 1.62 s
defaultdict: 1.05 s

Bearophile

2/26/2008 3:54:00 PM

marek.ro...@wp.pl:
> As for the original prooblem, why not use
> defaultdict? I think it's the most idiomatic way to achieve what we
> want. And also the fastest one, according to my quick-and-dirty tests:

It adds the new keys, I can't accept that:

>>> from collections import defaultdict as dd
>>> adict = {1:2, 3:4}
>>> addict = dd(lambda: None, adict)
>>> r = addict[10]
>>> addict
defaultdict(<function <lambda> at 0x008E6FB0>, {1: 2, 10: None, 3: 4})

Bye,
bearophile

marek.rocki

2/26/2008 4:46:00 PM

bearophileH...@lycos.com napisal(a):
> marek.ro...@wp.pl:
> > As for the original prooblem, why not use
> > defaultdict? I think it's the most idiomatic way to achieve what we
> > want. And also the fastest one, according to my quick-and-dirty tests:
>
> It adds the new keys, I can't accept that:

Right, my fault not noticing that. I experimented further with
__missing__ method and with "ask forgiveness" idiom (try... except
KeyError) and they were both noticeably slower. So checking with "in"
may be the most efficient way we have.

Regards,
Marek

Gabriel Genellina

2/27/2008 3:10:00 AM

En Tue, 26 Feb 2008 12:33:01 -0200, <castironpi@gmail.com> escribió:
> On Feb 26, 8:14 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:

>> I guess it's the method lookup that's the slow part. Factor it out of
>> the
>> loop and measure again::
>>
>> adict_get = adict.get
>> for _ in xrange(M):
>> for k in keys1:
>> r = adict_get(k, None)
>> for k in keys2:
>> r = adict_get(k, None)
>
> Can't be. The string 'get' is only hashed once, since it's hard-coded
> into the script, and looking it up can't be any slower than looking up
> __getitem__.

In the original code, the 'get' method is searched a lot of times (every
time it's called), not just once; Python can't assume that it remains the
same at every invocation.

And __getitem__ isn't searched by name; there is a predefined slot in the
type structure for it. In case the method were overriden in Python code,
the slot is replaced with a wrapper function that calls the Python code.

--
Gabriel Genellina

comp.lang.python

dict.get and str.xsplit

Bearophile

Marc 'BlackJack' Rintsch

Aaron Brady

Steve Holden

Marc 'BlackJack' Rintsch

Bearophile

Aaron Brady

marek.rocki

Bearophile

marek.rocki

Gabriel Genellina

x Login to ForumsZone