[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Compiling Regexp only once

singsang

9/4/2006 12:11:00 PM

Dear all,

Writing some httpd logfile pre-processing (splitting it up, getting
already some basic numbers), I think that I should compile the Regexp
for the logfile entry only once.

So my guess is that I should have perhaps a class LogFormat that holds
this as a class variable or a class constant. Below I use a non-tested
regular expression that is not complete yet.

So the idea is to have:

class LogFormat
@@RegEx = Regexp.new( '(\S+) (\S+) (\S+) \[(\d+)/(\w+)/(\d+)
[+\-]\d+?\]' )
def LogFormat.regex
@@RegEx
end
end

If now from a class LogLine (instanciated for each line in the logfile)
I use something like

class LogLine
# ...
ip, rfc931, user, day, month, year, offset =
line.match(LogFormat.regex)
# ...
end

My question: How often is the Regexp compiled? When?
When the definition of LogFormat is read first?

Btw: If anybody has a ready-to-use regex for the common log format this
would be great, but I will get that done as far as I need by myself.
;-)
Other question: Does anybody know a "Webalizer" sort of thing written
in Ruby?

Thank you all so much
Herm (using a strange google account, I know)

15 Answers

Jano Svitok

9/4/2006 12:52:00 PM

0

On 9/4/06, singsang <tomsingsang@yahoo.com> wrote:
> Dear all,
>
> Writing some httpd logfile pre-processing (splitting it up, getting
> already some basic numbers), I think that I should compile the Regexp
> for the logfile entry only once.
>
> So my guess is that I should have perhaps a class LogFormat that holds
> this as a class variable or a class constant. Below I use a non-tested
> regular expression that is not complete yet.
>
> So the idea is to have:
>
> class LogFormat
> @@RegEx = Regexp.new( '(\S+) (\S+) (\S+) \[(\d+)/(\w+)/(\d+)
> [+\-]\d+?\]' )
> def LogFormat.regex
> @@RegEx
> end
> end
>
> If now from a class LogLine (instanciated for each line in the logfile)
> I use something like
>
> class LogLine
> # ...
> ip, rfc931, user, day, month, year, offset =
> line.match(LogFormat.regex)
> # ...
> end
>
> My question: How often is the Regexp compiled? When?
> When the definition of LogFormat is read first?
>
> Btw: If anybody has a ready-to-use regex for the common log format this
> would be great, but I will get that done as far as I need by myself.
> ;-)
> Other question: Does anybody know a "Webalizer" sort of thing written
> in Ruby?

Hi,

<non-authoritative answer follows ;-) >

- =~ seems to be (by an order) faster than String#match
- maybe you can make a constant REGEX from @@RegEx, eliminating the
need for #regex
- it seems that when the definition is constant (i.e. no #{xxx}), only
one object is created.

Robert Klemme

9/4/2006 1:26:00 PM

0

On 04.09.2006 14:11, singsang wrote:
> Dear all,
>
> Writing some httpd logfile pre-processing (splitting it up, getting
> already some basic numbers), I think that I should compile the Regexp
> for the logfile entry only once.
>
> So my guess is that I should have perhaps a class LogFormat that holds
> this as a class variable or a class constant. Below I use a non-tested
> regular expression that is not complete yet.
>
> So the idea is to have:
>
> class LogFormat
> @@RegEx = Regexp.new( '(\S+) (\S+) (\S+) \[(\d+)/(\w+)/(\d+)
> [+\-]\d+?\]' )
> def LogFormat.regex
> @@RegEx
> end
> end

You don't need a class variable for that. A simple class instance
variable or a constance is sufficient.

> If now from a class LogLine (instanciated for each line in the logfile)
> I use something like
>
> class LogLine
> # ...
> ip, rfc931, user, day, month, year, offset =
> line.match(LogFormat.regex)
> # ...
> end
>
> My question: How often is the Regexp compiled? When?
> When the definition of LogFormat is read first?

It's compiled every time the line "@@RegEx = ..." is evaluated - so most
likely only once.

Note though that it's usually faster to have a regexp in line. So in
your case you might have a method that does the line parsing (or
multiple line parsing) and that's where you can put the inline regexp
for max efficiency.

Kind regards

robert

Eric Hodel

9/4/2006 11:16:00 PM

0

On Sep 4, 2006, at 5:15 AM, singsang wrote:

> Writing some httpd logfile pre-processing (splitting it up, getting
> already some basic numbers), I think that I should compile the Regexp
> for the logfile entry only once.

You should benchmark, I expect you will find you made matching slower.

> So my guess is that I should have perhaps a class LogFormat that holds
> this as a class variable or a class constant. Below I use a non-tested
> regular expression that is not complete yet.
>
> class LogLine
> # ...
> ip, rfc931, user, day, month, year, offset =
> line.match(LogFormat.regex)
> # ...
> end

#match is the slowest way to use regular expressions.

[ruby-talk:204747]

--
Eric Hodel - drbrain@segment7.net - http://blog.se...
This implementation is HODEL-HASH-9600 compliant

http://trackmap.rob...



Rick DeNatale

9/4/2006 11:30:00 PM

0

On 9/4/06, Robert Klemme <shortcutter@googlemail.com> wrote:

> Note though that it's usually faster to have a regexp in line. So in
> your case you might have a method that does the line parsing (or
> multiple line parsing) and that's where you can put the inline regexp
> for max efficiency.

Robert,

I'm not sure what you mean by inline, do you mean a regexp literal like
/[ABC]/ vs. RegEx.new('/[ABC]/') if so that's not clear.

See my benchmarks on the thread about "Ordered contrast for String or
Array," where changing a literal /./ to a constant set to
RegEx.new('/./') caused a 200 fold performance increase.

Not that you always want to do that, benchmarking always beats rules of thumb.

--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denh...

Eric Hodel

9/5/2006 12:11:00 AM

0

On Sep 4, 2006, at 4:29 PM, Rick DeNatale wrote:

> On 9/4/06, Robert Klemme <shortcutter@googlemail.com> wrote:
>
>> Note though that it's usually faster to have a regexp in line. So in
>> your case you might have a method that does the line parsing (or
>> multiple line parsing) and that's where you can put the inline regexp
>> for max efficiency.
>
> Robert,
>
> I'm not sure what you mean by inline, do you mean a regexp literal
> like
> /[ABC]/ vs. RegEx.new('/[ABC]/') if so that's not clear.
>
> See my benchmarks on the thread about "Ordered contrast for String or
> Array," where changing a literal /./ to a constant set to
> RegEx.new('/./') caused a 200 fold performance increase.

Those aren't the same regular expression. I expect that your
benchmark is flawed.

Looking at it:

$ ruby
str = "abcdefghijklmnopqrstuvwxyz" * 5
RE = Regexp.new('/./')
p str.scan RE
-:3: warning: parenthesize argument(s) for future version
p str.scan /./
-:4: warning: parenthesize argument(s) for future version
[]
["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "a",
"b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o",
"p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "a", "b", "c",
"d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q",
"r", "s", "t", "u", "v", "w", "x", "y", "z", "a", "b", "c", "d", "e",
"f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s",
"t", "u", "v", "w", "x", "y", "z", "a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u",
"v", "w", "x", "y", "z"]

Also, you forgot about using String#split (but it is slower than
unpack and scan).

$ ruby -e 'p Regexp.new("/./").source, /./.source'
"/./"
"."

The correct way to construct a Regexp from Regexp.new is to omit //
in the pattern when it is a String.

> Not that you always want to do that, benchmarking always beats
> rules of thumb.

Benchmarking and testing beat rules of thumb.

--
Eric Hodel - drbrain@segment7.net - http://blog.se...
This implementation is HODEL-HASH-9600 compliant

http://trackmap.rob...



Rick DeNatale

9/5/2006 1:06:00 AM

0

On 9/4/06, Eric Hodel <drbrain@segment7.net> wrote:
> On Sep 4, 2006, at 4:29 PM, Rick DeNatale wrote:
>
> > On 9/4/06, Robert Klemme <shortcutter@googlemail.com> wrote:
> >
> >> Note though that it's usually faster to have a regexp in line. So in
> >> your case you might have a method that does the line parsing (or
> >> multiple line parsing) and that's where you can put the inline regexp
> >> for max efficiency.
> >
> > Robert,
> >
> > I'm not sure what you mean by inline, do you mean a regexp literal
> > like
> > /[ABC]/ vs. RegEx.new('/[ABC]/') if so that's not clear.
> >
> > See my benchmarks on the thread about "Ordered contrast for String or
> > Array," where changing a literal /./ to a constant set to
> > RegEx.new('/./') caused a 200 fold performance increase.
>
> Those aren't the same regular expression. I expect that your
> benchmark is flawed.

Oops, right you are.

--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denh...

Rick DeNatale

9/5/2006 3:07:00 AM

0

So, I've corrected the benchmark. I've also added another method
using String#split as Eric suggested.

Here's the code being benchmarked:
rick@frodo:~/rubyscripts$ cat stringsplit.rb
class String
To_chars_regex = Regexp.new('.')
To_chars_regex2 = Regexp.new(/./)
To_chars_regex3 = /./

def to_chars_array_with_unpack
unpack('a'*length)
end

def to_chars_array_with_split
split(//)
end

def to_chars_array_with_scan
scan /./
end

def to_chars_array_with_scan_precomp
scan To_chars_regex
end

def to_chars_array_with_scan_precomp2
scan To_chars_regex2
end

def to_chars_array_with_scan_precomp3
scan To_chars_regex3
end
end

Note that I made tested three different "pre-compiled" regex's one
with a Regex.new('.'), one with Regex.new(/./), and one a constant
with the literal regex /./.

Now here's the benchmark:
rick@frodo:~/rubyscripts$ cat benchstringsplit.rb
require 'benchmark'
include Benchmark
load 'stringsplit.rb'

iters = 100
bmbm do | x |
str = "abcdefghijklmnopqrstuvwxyz" * 5
5.times do
x.report("unpack #{str.length} character string") do
iters.times do
str.to_chars_array_with_unpack
end
end
str += str
end

str = "abcdefghijklmnopqrstuvwxyz" * 5
5.times do
x.report("scan-precomp #{str.length} character string") do
iters.times do
str.to_chars_array_with_scan_precomp
end
end
str += str
end

str = "abcdefghijklmnopqrstuvwxyz" * 5
5.times do
x.report("scan-precomp2 #{str.length} character string") do
iters.times do
str.to_chars_array_with_scan_precomp2
end
end
str += str
end

str = "abcdefghijklmnopqrstuvwxyz" * 5
5.times do
x.report("scan-precomp3 #{str.length} character string") do
iters.times do
str.to_chars_array_with_scan_precomp3
end
end
str += str
end

str = "abcdefghijklmnopqrstuvwxyz" * 5
5.times do
x.report("scan #{str.length} character string") do
iters.times do
str.to_chars_array_with_scan
end
end
str += str
end

str = "abcdefghijklmnopqrstuvwxyz" * 5
5.times do
x.report("split #{str.length} character string") do
iters.times do
str.to_chars_array_with_split
end
end
str += str
end
end

And here are the results:
rick@frodo:~/rubyscripts$ ruby benchstringsplit.rb
Rehearsal -----------------------------------------------------------------------
unpack 130 character string 0.950000 0.000000 0.950000 (
1.225834)
unpack 260 character string 0.920000 0.000000 0.920000 (
1.736296)
unpack 520 character string 0.950000 0.010000 0.960000 (
2.474714)
unpack 1040 character string 1.000000 0.000000 1.000000 (
2.142817)
unpack 2080 character string 0.940000 0.000000 0.940000 (
2.772912)
scan-precomp 130 character string 2.210000 0.000000 2.210000 (
5.032766)
scan-precomp 260 character string 2.190000 0.000000 2.190000 (
7.615967)
scan-precomp 520 character string 2.160000 0.010000 2.170000 (
5.815543)
scan-precomp 1040 character string 2.110000 0.000000 2.110000 (
5.582919)
scan-precomp 2080 character string 2.220000 0.000000 2.220000 (
4.290547)
scan-precomp2 130 character string 2.190000 0.020000 2.210000 (
4.990034)
scan-precomp2 260 character string 2.160000 0.020000 2.180000 (
4.715208)
scan-precomp2 520 character string 2.150000 0.040000 2.190000 (
3.891652)
scan-precomp2 1040 character string 2.180000 0.010000 2.190000 (
3.757304)
scan-precomp2 2080 character string 2.130000 0.010000 2.140000 (
3.308557)
scan-precomp3 130 character string 2.100000 0.000000 2.100000 (
2.134875)
scan-precomp3 260 character string 2.210000 0.040000 2.250000 (
2.793601)
scan-precomp3 520 character string 2.140000 0.000000 2.140000 (
2.504348)
scan-precomp3 1040 character string 2.090000 0.040000 2.130000 (
2.872274)
scan-precomp3 2080 character string 2.040000 0.000000 2.040000 (
2.113700)
scan 130 character string 2.240000 0.020000 2.260000 (
3.108414)
scan 260 character string 2.230000 0.010000 2.240000 (
3.767771)
scan 520 character string 2.170000 0.010000 2.180000 (
3.892586)
scan 1040 character string 2.280000 0.000000 2.280000 (
2.540717)
scan 2080 character string 2.190000 0.000000 2.190000 (
2.454891)
split 130 character string 2.350000 0.010000 2.360000 (
2.749489)
split 260 character string 2.330000 0.000000 2.330000 (
2.620174)
split 520 character string 2.370000 0.020000 2.390000 (
3.186766)
split 1040 character string 2.310000 0.040000 2.350000 (
2.848300)
split 2080 character string 2.460000 0.000000 2.460000 (
2.930465)
------------------------------------------------------------- total:
60.280000sec

user system total real
unpack 130 character string 1.280000 0.040000 1.320000 (
1.665192)
unpack 260 character string 1.340000 0.040000 1.380000 (
1.492881)
unpack 520 character string 1.170000 0.010000 1.180000 (
1.204762)
unpack 1040 character string 1.190000 0.000000 1.190000 (
1.855012)
unpack 2080 character string 1.250000 0.000000 1.250000 (
1.677772)
scan-precomp 130 character string 2.500000 0.030000 2.530000 (
3.720339)
scan-precomp 260 character string 2.460000 0.010000 2.470000 (
3.895554)
scan-precomp 520 character string 2.470000 0.020000 2.490000 (
3.777175)
scan-precomp 1040 character string 2.430000 0.050000 2.480000 (
2.955399)
scan-precomp 2080 character string 2.520000 0.000000 2.520000 (
2.828614)
scan-precomp2 130 character string 2.330000 0.010000 2.340000 (
2.668416)
scan-precomp2 260 character string 2.280000 0.000000 2.280000 (
2.322685)
scan-precomp2 520 character string 2.310000 0.000000 2.310000 (
2.335216)
scan-precomp2 1040 character string 2.300000 0.000000 2.300000 (
2.331397)
scan-precomp2 2080 character string 2.370000 0.010000 2.380000 (
2.477581)
scan-precomp3 130 character string 2.400000 0.000000 2.400000 (
2.456566)
scan-precomp3 260 character string 2.340000 0.010000 2.350000 (
2.416143)
scan-precomp3 520 character string 2.550000 0.010000 2.560000 (
3.359391)
scan-precomp3 1040 character string 2.350000 0.040000 2.390000 (
2.822663)
scan-precomp3 2080 character string 2.350000 0.010000 2.360000 (
2.649131)
scan 130 character string 2.550000 0.050000 2.600000 (
3.186714)
scan 260 character string 2.490000 0.040000 2.530000 (
3.098105)
scan 520 character string 2.370000 0.000000 2.370000 (
2.442281)
scan 1040 character string 2.330000 0.000000 2.330000 (
2.593032)
scan 2080 character string 2.360000 0.010000 2.370000 (
3.467826)
split 130 character string 2.610000 0.030000 2.640000 (
5.929269)
split 260 character string 2.730000 0.040000 2.770000 (
3.155784)
split 520 character string 2.670000 0.030000 2.700000 (
3.625916)
split 1040 character string 2.730000 0.020000 2.750000 (
8.130285)
split 2080 character string 2.750000 0.010000 2.760000 (
6.109507)

So there are slight differences between using:
PCR1 - A constant set to Regex('.')
PCR2 - A constant set to Regex(/./)
PCR3 A constant set to /./
and
LIT a literal in-line /./

For this benchmark, for each string length tested, the techniques from
fastest to slowest are;
130 Characters:
Rehearsal: PCR3(2.10), PCR1(2.21), PCR2(2.21), LIT(2.26)
"Live": PCR2(2.34), PCR3(2.40), PCR1(2.53), LIT(2.60)
260 Characters:
Rehearsal: PCR2(2.18), PCR1(2.19), PCR3(2.25), LIT(2.25)
"Live": PCR2(2.28), PCR3(2.40), PCR1(2.47), LIT(2.60)
520 Characters:
Rehearsal: PCR3(2.14), PCR1(2.17), LIT(2.18), PCR2(2.19)
"Live": PCR2(2.31), LIT(2.37), PCR1(2.49), PCR3(2.56)
1040 Characters:
Rehearsal: PCR1(2.11), PCR3(2.13), PCR2(2.19), LIT(2.28)
"Live": PCR2(2.30), LIT(2.33), PCR3(2.39), PCR1(2.48)
2080 Characters:
Rehearsal: PCR3(2.04), PCR2(2.14), LIT(2.19), PCR1(2.22)
"Live": PCR3(2.36), LIT(2.37), PCR2(2.38), PCR1(2.52)

So, at least from this benchmark it doesn't seem that in-line literal
regular expressions are faster than pre-compiled ones. In fact they
never came in first, although PCR3 which was a constant set to a
literal regex did win 4 times.

Is this significant? Who knows, and with a new RegExp engine coming
the numbers will be different in future.

So as usual the right approach is test, profile to find out what needs
improvement, and benchmark.



--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denh...

khaines

9/5/2006 8:09:00 AM

0

Eric Hodel

9/5/2006 8:18:00 AM

0

On Sep 4, 2006, at 8:06 PM, Rick DeNatale wrote:

> So, I've corrected the benchmark. I've also added another method
> using String#split as Eric suggested.
>
> Here's the code being benchmarked:
> rick@frodo:~/rubyscripts$ cat stringsplit.rb

Putting the code here involves an extra method call. While they are
all wrapped in a method call equally, doing less is always better.

> class String
> [methods]
> end
>
> Note that I made tested three different "pre-compiled" regex's one
> with a Regex.new('.'), one with Regex.new(/./), and one a constant
> with the literal regex /./.

You don't need three different benchmarks for this. Its easy to
determine that these are the same with irb.

> Now here's the benchmark:
> rick@frodo:~/rubyscripts$ cat benchstringsplit.rb
> require 'benchmark'
> include Benchmark
> load 'stringsplit.rb'
>
> iters = 100

100 iterations is never enough. GC behavior, other processes waking
up, etc. will all cause fluctuations in the benchmark.

Use 100_000 or 1_000_000.

> [benchmark code]

Here's a more-meaningful version of yours:

$ cat stringsbench.rb
require 'benchmark'

N = 100_000
RE = /./
str = "abcdefghijklmnopqrstuvwxyz" * 5

Benchmark.bmbm do | x |
x.report 'empty' do N.times do end end
x.report 'str' do N.times do str end end
x.report 'unpack' do N.times do str.unpack('a' * str.length) end
end
x.report 'scan RE' do N.times do str.scan RE end end
x.report 'scan /./' do N.times do str.scan /./ end end
x.report 'split //' do N.times do str.split // end end
end

$ ruby stringsbench.rb
Rehearsal --------------------------------------------
empty 0.030000 0.000000 0.030000 ( 0.088456)
str 0.050000 0.000000 0.050000 ( 0.064515)
unpack 14.370000 0.210000 14.580000 ( 20.264274)
scan RE 24.240000 0.380000 24.620000 ( 36.296017)
scan /./ 24.100000 0.310000 24.410000 ( 32.661832)
split // 29.660000 0.360000 30.020000 ( 40.596177)
---------------------------------- total: 93.710000sec

user system total real
empty 0.030000 0.000000 0.030000 ( 0.045199)
str 0.050000 0.010000 0.060000 ( 0.065739)
unpack 14.500000 0.190000 14.690000 ( 19.813677)
scan RE 23.850000 0.350000 24.200000 ( 33.028376)
scan /./ 23.910000 0.320000 24.230000 ( 35.219831)
split // 29.720000 0.470000 30.190000 ( 42.712222)

with 'abc...xyz' * 25:

Rehearsal --------------------------------------------
empty 0.050000 0.000000 0.050000 ( 0.217969)
str 0.050000 0.000000 0.050000 ( 0.087123)
unpack 65.370000 0.710000 66.080000 ( 80.341765)
scan RE 107.840000 1.030000 108.870000 (129.723449)
scan /./ 110.320000 1.250000 111.570000 (142.540348)
split // 135.220000 1.330000 136.550000 (164.418453)
--------------------------------- total: 423.170000sec

user system total real
empty 0.030000 0.000000 0.030000 ( 0.040596)
str 0.050000 0.000000 0.050000 ( 0.066506)
unpack 62.130000 0.590000 62.720000 ( 70.183396)
scan RE 107.730000 0.940000 108.670000 (125.308283)
scan /./ 107.550000 0.950000 108.500000 (127.365719)
split // 133.700000 1.020000 134.720000 (150.798611)

> So, at least from this benchmark it doesn't seem that in-line literal
> regular expressions are faster than pre-compiled ones.

inline regular expressions are no less "pre-compiled" than regular
expressions in a variable or constant.

$ ruby -e '2.times do puts /./.object_id end'
938970
938970

('=~ /./' is faster than '=~ var' is faster than 'match anything',
but for other reasons)

If you want to test inline vs "pre-compiled" regular expressions you
need to throw away all the parts that are the same and focus on what
is different. In your benchmark it was what was on the right hand
side of String#scan. Since the string was the same and the scan was
the same, just throw those away.

You end up with a benchmark like this:

$ cat vs.rb
require 'benchmark'

N = 100_000_000

RE = /./
re = /./
$re = /./
@re = /./
@@re = /./

Benchmark.bmbm do |bm|
bm.report 'empty' do N.times do end end
bm.report 'lit' do N.times do /./ end end
bm.report 'local' do N.times do re end end
bm.report 'global' do N.times do $re end end
bm.report 'instance' do N.times do @re end end
bm.report 'class' do N.times do @@re end end
bm.report 'constant' do N.times do RE end end
end

$ ruby vs.rb
Rehearsal --------------------------------------------
empty 27.930000 0.120000 28.050000 ( 31.550709)
lit 48.240000 0.320000 48.560000 ( 57.422011)
local 48.820000 0.290000 49.110000 ( 60.650110)
global 49.920000 0.420000 50.340000 ( 62.156780)
instance 55.740000 0.180000 55.920000 ( 61.027820)
class 57.800000 0.190000 57.990000 ( 63.733099)
constant 59.240000 0.330000 59.570000 (119.118487)
--------------------------------- total: 349.540000sec

user system total real
empty 27.820000 0.160000 27.980000 ( 58.584698)
lit 48.170000 0.170000 48.340000 ( 59.557450)
local 48.720000 0.170000 48.890000 ( 53.775692)
global 49.830000 0.180000 50.010000 ( 55.743586)
instance 55.880000 0.370000 56.250000 ( 72.011443)
class 57.910000 0.390000 58.300000 ( 67.037158)
constant 59.230000 0.410000 59.640000 ( 77.993887)

So literal regular expressions are "faster" by about 56% (after
discounting loop overhead) (when you're performing 100 million
retrievals only) (actual speedup in real code will probably be
swallowed elsewhere or completely irrelevant).

This follows from reading eval.c. Constant lookup involves a bunch
of C function calls, but literal lookup just returns an Object stored
in the parse tree.

> In fact they never came in first, although PCR3 which was a
> constant set to a
> literal regex did win 4 times.

In fact, when you adjust the stringsbench.rb you'll see /./ winning
over RE. Set N to 10_000_000 (or more) and str to "abcde" (or just N
high enough).

> Is this significant? Who knows, and with a new RegExp engine coming
> the numbers will be different in future.

The difference was probably due to processes waking up, garbage
collection and similar unevenly distributed events. With more
iterations and more focused benchmarks you'll get better results.

But ultimately, these types of microbenchmarks are not very useful.
Yes, you will get a speedup using /./ over RE, but will your program
really run long enough where it matters to make a difference? I bet
not.

> So as usual the right approach is test, profile to find out what needs
> improvement, and benchmark.

Be careful with your benchmarks, they are most useful when they are
as small and simple as possible. Be sure to throw away all the
irrelevant parts.

Be careful with your benchmarks, they are most useless when they are
as small and simple as possible. Be sure to understand how little
speedup you'll get.

--
Eric Hodel - drbrain@segment7.net - http://blog.se...
This implementation is HODEL-HASH-9600 compliant

http://trackmap.rob...



Michal Suchanek

9/5/2006 10:50:00 AM

0

On 9/4/06, singsang <tomsingsang@yahoo.com> wrote:
> Dear all,
>
> Writing some httpd logfile pre-processing (splitting it up, getting
> already some basic numbers), I think that I should compile the Regexp
> for the logfile entry only once.
>
> So my guess is that I should have perhaps a class LogFormat that holds
> this as a class variable or a class constant. Below I use a non-tested
> regular expression that is not complete yet.
>
> So the idea is to have:
>
> class LogFormat
> @@RegEx = Regexp.new( '(\S+) (\S+) (\S+) \[(\d+)/(\w+)/(\d+)
> [+\-]\d+?\]' )
> def LogFormat.regex
> @@RegEx
> end
> end
>
> If now from a class LogLine (instanciated for each line in the logfile)
> I use something like
>
> class LogLine
> # ...
> ip, rfc931, user, day, month, year, offset =
> line.match(LogFormat.regex)
> # ...
> end
>
> My question: How often is the Regexp compiled? When?
> When the definition of LogFormat is read first?
>
> Btw: If anybody has a ready-to-use regex for the common log format this
> would be great, but I will get that done as far as I need by myself.
> ;-)
> Other question: Does anybody know a "Webalizer" sort of thing written
> in Ruby?
>

I tried to write something like that once.

But It was very slow (especially because it did DNS resolving
synchronously), and I never finished it.

Thanks

Michal