Jeff Schwab
11/12/2006 4:48:00 PM
Peter Schrammel wrote:
> Jeffrey Schwab wrote:
>> Peter Schrammel wrote:
>>
>>> got problem with big regexes:
>>> I have a regex of about 70000+ words concated with '|' that I'd like to
>>> match as a regex. /bla|blub|foo|bar|.....(70000)/
>>>
>>> But unfortunately ruby gives me a 'regular expression too big' if I'm
>>> trying to build such a thing.
>>> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
>>> regexes. Is there a way around this (without going down to 2000 words) ?
>>>
>>> Thanks for any hint
>> You could optimize the regex a little for size, e.g. by factoring out
>> common prefixes:
>>
>> (b(l(a|ub)|ar)|foo)...
>
> Thought of that.
Good for you.
>> Of course, that will only help if the | alternatives have a reasonable
>> amount of redundancy. Alternatively, you could just break the whole
>> thing into multiple expressions. Instead of
>>
>> if /first_part|second_part/ =~ text
>>
>> You could try:
>>
>> if /first_part/ =~ text or /second_part/ =~ text
>
> Yes, that was my next thought but where to split? Just count the bytes
> and splitt near 1 <<16?
Probably better not to construct the mega-regex in the first place. For
the record, finding yourself on the edges of the language's capacity
like this might be a sign that refactoring is in order.
Even if you're going to stick with your current technique, but work
around the size limitation, it's probably better not to build a megex
(TM) you'll have to split up. As you put the pattern together, only add
alternations as long as the cumulative size will be < 0x10000 (or a
well-commented static constant with that value).
> Why is there a limitation at all? I implemented the same thing in perl
> and it no complains ...
> Is the regexp engine of perl that much better?
As Friedl notes, Perl is darned close to being the ideal regex language. :)
Ruby regexes aren't necessarily meant to be the one hammer that can
drive every nail. If you want to be able to view every problem through
a regex lens, you'll probably have to dig a little deeper than
categorizing one language's engine as simply "better" than another.
Big, static sizes like 2**16 are often used to avoid dynamic allocation,
or otherwise improve runtime efficiency.
Also for the record: I'm a big fan of regexes. Though a lot of people
complain about complexity or efficiency issues, I've never had a problem
with either. I would be interested to see a comparison of the relative
merits and limitations of various engines, e.g. regex lengths,
benchmarks, big-O complexity, and ability to handle null bytes.