Asp Forum - regular expression too big

Peter Schrammel

11/12/2006 2:26:00 PM

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

Peter

30 Answers

Jeff Schwab

11/12/2006 2:36:00 PM

Peter Schrammel wrote:

> got problem with big regexes:
> I have a regex of about 70000+ words concated with '|' that I'd like to
> match as a regex. /bla|blub|foo|bar|.....(70000)/
>
> But unfortunately ruby gives me a 'regular expression too big' if I'm
> trying to build such a thing.
> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
> regexes. Is there a way around this (without going down to 2000 words) ?
>
> Thanks for any hint

You could optimize the regex a little for size, e.g. by factoring out
common prefixes:

(b(l(a|ub)|ar)|foo)...

Of course, that will only help if the | alternatives have a reasonable
amount of redundancy. Alternatively, you could just break the whole
thing into multiple expressions. Instead of

if /first_part|second_part/ =~ text

You could try:

if /first_part/ =~ text or /second_part/ =~ text

Peter Schrammel

11/12/2006 3:02:00 PM

Jeffrey Schwab wrote:
> Peter Schrammel wrote:
>
>> got problem with big regexes:
>> I have a regex of about 70000+ words concated with '|' that I'd like to
>> match as a regex. /bla|blub|foo|bar|.....(70000)/
>>
>> But unfortunately ruby gives me a 'regular expression too big' if I'm
>> trying to build such a thing.
>> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
>> regexes. Is there a way around this (without going down to 2000 words) ?
>>
>> Thanks for any hint
>
> You could optimize the regex a little for size, e.g. by factoring out
> common prefixes:
>
> (b(l(a|ub)|ar)|foo)...

Thought of that.

> Of course, that will only help if the | alternatives have a reasonable
> amount of redundancy. Alternatively, you could just break the whole
> thing into multiple expressions. Instead of
>
> if /first_part|second_part/ =~ text
>
> You could try:
>
> if /first_part/ =~ text or /second_part/ =~ text

Yes, that was my next thought but where to split? Just count the bytes
and splitt near 1 <<16?

Why is there a limitation at all? I implemented the same thing in perl
and it no complains ...
Is the regexp engine of perl that much better?

Thanks for the reply

Louis J Scoras

11/12/2006 3:28:00 PM

This doesn't really help with the actual question of getting past
regex size limits, but are you sure that regexen are correct solution
to this problem? Unless I'm mistaken, the above match is going to be
horribly, painfully slow; the issue you're running into is probably an
indication that you might want to look elsewhere.

I don't want to make any silly claims without doing any benchmarking
on your data, but I would imagine that even doing an efficient search
on a sorted array of your words would give you better performance than
the regex search. You can move up from there into hashes or other
data structures for this sort of thing.

--
Lou

Ross Bamford

11/12/2006 3:35:00 PM

On Sun, 12 Nov 2006 15:01:56 -0000, Peter Schrammel =

<peter.schrammel@gmx.de> wrote:

> Why is there a limitation at all? I implemented the same thing in perl=

> and it no complains ...
> Is the regexp engine of perl that much better?
>

Irrespective of whether regex the best solution for your needs, it seems=
=

Oniguruma will improve the situation somewhat with respect to large =

regular expressions.

$ wc -w acd-holmes-return.txt
112110 acd-holmes-return.txt

$ cat bigregexp.rb
txt =3D File.read('acd-holmes-return.txt').split(' ')
re =3D /^(#{txt.map { |e| "#{Regexp.escape(e)}" }.join('|')})$/

p re =3D~ "beautiful"
p $&
__END__

$ ruby -v bigregexp.rb
ruby 1.8.5 (2006-08-25) [i686-linux]
bigregexp.rb:2: regular expression too big: /...[snipped].../ (RegexpErr=
or)

$ ruby9 -v bigregexp.rb
ruby 1.9.0 (2006-09-07) [i686-linux]
0
"beautiful"

-- =

Ross Bamford - rosco@roscopeco.remove.co.uk

Jeff Schwab

11/12/2006 4:48:00 PM

Peter Schrammel wrote:
> Jeffrey Schwab wrote:
>> Peter Schrammel wrote:
>>
>>> got problem with big regexes:
>>> I have a regex of about 70000+ words concated with '|' that I'd like to
>>> match as a regex. /bla|blub|foo|bar|.....(70000)/
>>>
>>> But unfortunately ruby gives me a 'regular expression too big' if I'm
>>> trying to build such a thing.
>>> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
>>> regexes. Is there a way around this (without going down to 2000 words) ?
>>>
>>> Thanks for any hint
>> You could optimize the regex a little for size, e.g. by factoring out
>> common prefixes:
>>
>> (b(l(a|ub)|ar)|foo)...
>
> Thought of that.

Good for you.

>> Of course, that will only help if the | alternatives have a reasonable
>> amount of redundancy. Alternatively, you could just break the whole
>> thing into multiple expressions. Instead of
>>
>> if /first_part|second_part/ =~ text
>>
>> You could try:
>>
>> if /first_part/ =~ text or /second_part/ =~ text
>
> Yes, that was my next thought but where to split? Just count the bytes
> and splitt near 1 <<16?

Probably better not to construct the mega-regex in the first place. For
the record, finding yourself on the edges of the language's capacity
like this might be a sign that refactoring is in order.

Even if you're going to stick with your current technique, but work
around the size limitation, it's probably better not to build a megex
(TM) you'll have to split up. As you put the pattern together, only add
alternations as long as the cumulative size will be < 0x10000 (or a
well-commented static constant with that value).

> Why is there a limitation at all? I implemented the same thing in perl
> and it no complains ...
> Is the regexp engine of perl that much better?

As Friedl notes, Perl is darned close to being the ideal regex language. :)

Ruby regexes aren't necessarily meant to be the one hammer that can
drive every nail. If you want to be able to view every problem through
a regex lens, you'll probably have to dig a little deeper than
categorizing one language's engine as simply "better" than another.
Big, static sizes like 2**16 are often used to avoid dynamic allocation,
or otherwise improve runtime efficiency.

Also for the record: I'm a big fan of regexes. Though a lot of people
complain about complexity or efficiency issues, I've never had a problem
with either. I would be interested to see a comparison of the relative
merits and limitations of various engines, e.g. regex lengths,
benchmarks, big-O complexity, and ability to handle null bytes.

Ken Bloom

11/12/2006 7:15:00 PM

On Sun, 12 Nov 2006 15:25:49 +0100, Peter Schrammel wrote:

> Hi,
>
> got problem with big regexes:
> I have a regex of about 70000+ words concated with '|' that I'd like to
> match as a regex. /bla|blub|foo|bar|.....(70000)/
>
> But unfortunately ruby gives me a 'regular expression too big' if I'm
> trying to build such a thing.
> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
> regexes. Is there a way around this (without going down to 2000 words) ?
>
> Thanks for any hint
>
> Peter

Maybe a trie would be useful?
http://rubyforge.org/proj...
(or there's another trie at http://kzk9.net/software/miscprog...)

--Ken

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu...

Ken Bloom

11/12/2006 7:26:00 PM

On Sun, 12 Nov 2006 19:15:12 +0000, Ken Bloom wrote:

> On Sun, 12 Nov 2006 15:25:49 +0100, Peter Schrammel wrote:
>
>> Hi,
>>
>> got problem with big regexes:
>> I have a regex of about 70000+ words concated with '|' that I'd like to
>> match as a regex. /bla|blub|foo|bar|.....(70000)/
>>
>> But unfortunately ruby gives me a 'regular expression too big' if I'm
>> trying to build such a thing.
>> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
>> regexes. Is there a way around this (without going down to 2000 words) ?
>>
>> Thanks for any hint
>>
>> Peter
>
> Maybe a trie would be useful?
> http://rubyforge.org/proj...
> (or there's another trie at http://kzk9.net/software/miscprog...)

On one more thing: to implement substring search using a trie, when
adding words to the trie, you should generate appropriate back links so
that you can implement =~ use the Knuth-Morris-Pratt algorithm for matching.

--Ken

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu...

Paul Lutus

11/12/2006 7:51:00 PM

Peter Schrammel wrote:

> Hi,
>
> got problem with big regexes:
> I have a regex of about 70000+ words concated with '|' that I'd like to
> match as a regex. /bla|blub|foo|bar|.....(70000)/

It appears you are trying to match a word against a list of words. Yes? From
an efficiency standpoint, if the expression is really as shown (no real
regex syntax), why not create a hash out of the data and see if the desired
word is present that way? That would likely be much faster than running the
(non-regex) regex against each word in the input data.

When you post an question like this, it is always a good idea to reveal the
purpose as well as the method.

------------------------------------------

#!/usr/bin/ruby -w

data = File.read("/usr/share/dict/words")

words = data.split(%r{\s+}m)

hash = {}

words.each do |word|
hash[word] = true
end

while true
print "Enter a word:"
word = STDIN.readline.chomp
if hash[word]
puts "Valid word."
else
puts "Not present."
end
end

------------------------------------------

Now you have a hash that can be used for testing a list of words.

--
Paul Lutus
http://www.ara...

Simon Strandgaard

11/12/2006 8:47:00 PM

On 11/12/06, Peter Schrammel <peter.schrammel@gmx.de> wrote:
> got problem with big regexes:
> I have a regex of about 70000+ words concated with '|' that I'd like to
> match as a regex. /bla|blub|foo|bar|.....(70000)/

if you have many words to check for then consider using a
http://en.wikipedia.org/wiki/Bl...

--
Simon Strandgaard

Peter Schrammel

11/12/2006 9:08:00 PM

> When you post an question like this, it is always a good idea to
reveal the
> purpose as well as the method.
>
>

First of all: Thanks for the replies, I think I have enough input to
chew on.

And to reveal the purpose:
I'd like to match a LOT of words/strings (words with spaces) and
sometimes regexes against a lot of small and medium size( < 10000 byte )
strings.

The comparison with Perl was just a comparison of the regexp engines and
not the language itself. It was just that until now I had never to think
much about the underlying hardware .... because ruby did that for me. So
I was surprised by the "too big exception".

Peter

comp.lang.ruby

regular expression too big

Peter Schrammel

Jeff Schwab

Peter Schrammel

Louis J Scoras

Ross Bamford

Jeff Schwab

Ken Bloom

Ken Bloom

Paul Lutus

Simon Strandgaard

Peter Schrammel

x Login to ForumsZone