Mark J. Reed
12/22/2005 5:59:00 AM
"basi" <basi_lio@hotmail.com> writes:
>Hello,
>Just having an outstandingly hard time to even come close to being able
>to translate the following string sequences in one or a series of
>regular expressions. These are allowable prefix combinations in a
>language I'm doing some text analysis on.
>i pa(ki) pag
>The first one allows the following choices:
>, pag, ipa, ipag, ipapag, papag, ipaki, ipakipag, paki, pakipag
You need to use alternatives to handle the nonempty constraint.
For instance, the basic structure for the first sequence is this:
/^(strings with i|strings with pa(ki)|strings with pag)$/
For instance, this works:
/^(i(pa(ki)?)?(pag)?|i?pa(ki)?(pag)?|i?(pa(ki)?)?pag)$/
But it's redundant, since several strings will match more than one of the
alternatives. For instance, the first alt takes care of all the strings with
i, so there's no need for i? in the other two parts. Similarly, the first and
second parts together handle all the strings with pa (with or without i- and
with or without -ki), so there's no need to include them in the third alt.
This matches all the valid strings, and I haven't found an invalid string
that it matches - but it's almost 1 AM, so I could be missing something. :)
/^(i(pa(ki)?)?(pag)?|pa(ki)?(pag)?|pag)$/
If you need to worry about the difference between lines and strings, you should
use \A and \Z instead of ^ and $. It may be more efficient to use
non-capturing parens (?:...) instead of the plain ones, but I think it
makes it harder to read and type.