Asp Forum - any chance regular expressions are cached?

Mark Harrison

3/10/2008 12:43:00 AM

I've got a bit of code in a function like this:

s=re.sub(r'\n','\n'+spaces,s)
s=re.sub(r'^',spaces,s)
s=re.sub(r' *\n','\n',s)
s=re.sub(r' *$','',s)
s=re.sub(r'\n*$','',s)

Is there any chance that these will be cached somewhere, and save
me the trouble of having to declare some global re's if I don't
want to have them recompiled on each function invocation?

Many TIA!
Mark

--
Mark Harrison
Pixar Animation Studios

8 Answers

Tim Chase

3/10/2008 12:53:00 AM

> s=re.sub(r'\n','\n'+spaces,s)
> s=re.sub(r'^',spaces,s)
> s=re.sub(r' *\n','\n',s)
> s=re.sub(r' *$','',s)
> s=re.sub(r'\n*$','',s)
>
> Is there any chance that these will be cached somewhere, and save
> me the trouble of having to declare some global re's if I don't
> want to have them recompiled on each function invocation?

>>> import this
....
Explicit is better than implicit
....

Sounds like what you want is to use the compile() call to compile
once, and then use the resulting objects:

re1 = re.compile(r'\n')
re2 = re.compile(r'^')
...
s = re1.sub('\n' + spaces, s)
s = re2.sub(spaces, s)
...

The compile() should be done once (outside loops, possibly at a
module level, as, in a way, they're constants) and then you can
use the resulting object without the overhead of compiling.

-tkc

Ryan Ginstrom

3/10/2008 1:12:00 AM

> On Behalf Of Tim Chase
> Sounds like what you want is to use the compile() call to
> compile once, and then use the resulting objects:
>
> re1 = re.compile(r'\n')
> re2 = re.compile(r'^')
> ...
> s = re1.sub('\n' + spaces, s)
> s = re2.sub(spaces, s)

Yes. And I would go a step further and suggest that regular expressions are
best avoided in favor of simpler things when possible. That will make the
code easier to debug, and probably faster.

A couple of examples:
>>> text = """spam spam spam
spam spam

spam

spam"""
>>> # normalize newlines
>>> print "\n".join([line for line in text.splitlines() if line])
spam spam spam
spam spam
spam
spam
>>> # normalize whitespace
>>> print " ".join(text.split())
spam spam spam spam spam spam spam
>>> # strip leading/trailing space
>>> text = " spam "
>>> print text.lstrip()
spam
>>> print text.rstrip()
spam
>>> print text.strip()
spam

Regards,
Ryan Ginstrom

Terry Reedy

3/10/2008 2:14:00 AM

<mh@pixar.com> wrote in message
news:bu%Aj.5528$fX7.893@nlpi061.nbdc.sbc.com...
| I've got a bit of code in a function like this:
|
| s=re.sub(r'\n','\n'+spaces,s)
| s=re.sub(r'^',spaces,s)
| s=re.sub(r' *\n','\n',s)
| s=re.sub(r' *$','',s)
| s=re.sub(r'\n*$','',s)
|
| Is there any chance that these will be cached somewhere, and save
| me the trouble of having to declare some global re's if I don't
| want to have them recompiled on each function invocation?

The last time I looked, several versions ago, re did cache.
Don't know if still true. Not part of spec, I don't think.

tjr

Steven D'Aprano

3/10/2008 3:39:00 AM

On Mon, 10 Mar 2008 00:42:47 +0000, mh wrote:

> I've got a bit of code in a function like this:
>
> s=re.sub(r'\n','\n'+spaces,s)
> s=re.sub(r'^',spaces,s)
> s=re.sub(r' *\n','\n',s)
> s=re.sub(r' *$','',s)
> s=re.sub(r'\n*$','',s)
>
> Is there any chance that these will be cached somewhere, and save me the
> trouble of having to declare some global re's if I don't want to have
> them recompiled on each function invocation?

At the interactive interpreter, type "help(re)" [enter]. A page or two
down, you will see:

purge()
Clear the regular expression cache

and looking at the source code I see many calls to _compile() which
starts off with:

def _compile(*key):
# internal: compile pattern
cachekey = (type(key[0]),) + key
p = _cache.get(cachekey)
if p is not None:
return p

So yes, the re module caches it's regular expressions.

Having said that, at least four out of the five examples you give are
good examples of when you SHOULDN'T use regexes.

re.sub(r'\n','\n'+spaces,s)

is better written as s.replace('\n', '\n'+spaces). Don't believe me?
Check this out:

>>> s = 'hello\nworld'
>>> spaces = " "
>>> from timeit import Timer
>>> Timer("re.sub('\\n', '\\n'+spaces, s)",
.... "import re;from __main__ import s, spaces").timeit()
7.4031901359558105
>>> Timer("s.replace('\\n', '\\n'+spaces)",
.... "import re;from __main__ import s, spaces").timeit()
1.6208670139312744

The regex is nearly five times slower than the simple string replacement.

Similarly:

re.sub(r'^',spaces,s)

is better written as spaces+s, which is nearly eleven times faster.

Also:

re.sub(r' *$','',s)
re.sub(r'\n*$','',s)

are just slow ways of writing s.rstrip(' ') and s.rstrip('\n').

--
Steven

John Machin

3/10/2008 4:43:00 AM

On Mar 10, 11:42 am, m...@pixar.com wrote:
> I've got a bit of code in a function like this:
>
> s=re.sub(r'\n','\n'+spaces,s)
> s=re.sub(r'^',spaces,s)
> s=re.sub(r' *\n','\n',s)
> s=re.sub(r' *$','',s)
> s=re.sub(r'\n*$','',s)
>
> Is there any chance that these will be cached somewhere, and save
> me the trouble of having to declare some global re's if I don't
> want to have them recompiled on each function invocation?
>

Yes they will be cached. But do yourself a favour and check out the
string methods.

E.g.
>>> import re
>>> def opfunc(s, spaces):
.... s=re.sub(r'\n','\n'+spaces,s)
.... s=re.sub(r'^',spaces,s)
.... s=re.sub(r' *\n','\n',s)
.... s=re.sub(r' *$','',s)
.... s=re.sub(r'\n*$','',s)
.... return s
....
>>> def myfunc(s, spaces):
.... return '\n'.join(spaces + x.rstrip() if x.rstrip() else '' for
x in s.splitlines())
....
>>> t1 = 'foo\nbar\nzot\n'
>>> t2 = 'foo\nbar \nzot\n'
>>> t3 = 'foo\n\nzot\n'
>>> [opfunc(s, ' ') for s in (t1, t2, t3)]
[' foo\n bar\n zot', ' foo\n bar\n zot', ' foo\n\n
zot']
>>> [myfunc(s, ' ') for s in (t1, t2, t3)]
[' foo\n bar\n zot', ' foo\n bar\n zot', ' foo\n\n
zot']
>>>

Mark Harrison

3/10/2008 6:01:00 PM

John Machin <sjmachin@lexicon.net> wrote:
> Yes they will be cached.

great.

> But do yourself a favour and check out the
> string methods.

Nifty... thanks all!

--
Mark Harrison
Pixar Animation Studios

John Machin

3/10/2008 8:57:00 PM

On Mar 10, 3:42 pm, John Machin <sjmac...@lexicon.net> wrote rather
baroquely:
> ...>>> def myfunc(s, spaces):
>
> ... return '\n'.join(spaces + x.rstrip() if x.rstrip() else '' for
> x in s.splitlines())

Better:
.... return '\n'.join((spaces + x).rstrip() for x in
s.splitlines())

Arnaud Delobelle

3/10/2008 9:53:00 PM

On Mar 10, 3:39 am, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:
[...]
> Having said that, at least four out of the five examples you give are
> good examples of when you SHOULDN'T use regexes.
>
> re.sub(r'\n','\n'+spaces,s)
>
> is better written as s.replace('\n', '\n'+spaces). Don't believe me?
> Check this out:
>
> >>> s = 'hello\nworld'
> >>> spaces = " "
> >>> from timeit import Timer
> >>> Timer("re.sub('\\n', '\\n'+spaces, s)",
>
> ... "import re;from __main__ import s, spaces").timeit()
> 7.4031901359558105>>> Timer("s.replace('\\n', '\\n'+spaces)",
>
> ... "import re;from __main__ import s, spaces").timeit()
> 1.6208670139312744
>
> The regex is nearly five times slower than the simple string replacement.

I agree that the second version is better, but most of the time in the
first one is spend compiling the regexp, so the comparison is not
really fair:

>>> s = 'hello\nworld'
>>> spaces = " "
>>> import re
>>> r = re.compile('\\n')
>>> from timeit import Timer
>>> Timer("r.sub('\\n'+spaces, s)", "from __main__ import r,spaces,s").timeit()
1.7726190090179443
>>> Timer("s.replace('\\n', '\\n'+spaces)", "from __main__ import s, spaces").timeit()
0.76739501953125
>>> Timer("re.sub('\\n', '\\n'+spaces, s)", "from __main__ import re, s, spaces").timeit()
4.3669700622558594
>>>

Regexps are still more than twice slower.

--
Arnaud

comp.lang.python

any chance regular expressions are cached?

Mark Harrison

Tim Chase

Ryan Ginstrom

Terry Reedy

Steven D'Aprano

John Machin

Mark Harrison

John Machin

Arnaud Delobelle

x Login to ForumsZone