Asp Forum - Regex: Up for a challenge?

basi

12/21/2005 11:03:00 PM

Hello,
Just having an outstandingly hard time to even come close to being able
to translate the following string sequences in one or a series of
regular expressions. These are allowable prefix combinations in a
language I'm doing some text analysis on.

i pa(ki) pag
ma i ka pa(ki) pag
ma pa ka
ma(ki) pag ka
man
m(an) ag ka sin
m(an) ag kan(da)
m(an) ag si(pag)
m(an) ag pa

The first one allows the following choices:
i, pa, pag, ipa, ipag, ipapag, papag, ipaki, ipakipag, paki, pakipag

The syllable inside parentheses can only be selected if the syllable or
letter it is attached to is selected.

Just something to take our minds away from Eckel-ysteria.

Thanks again,
basi

6 Answers

Ross Bamford

12/21/2005 11:29:00 PM

On Wed, 21 Dec 2005 23:03:00 -0000, basi <basi_lio@hotmail.com> wrote:

> Hello,
> Just having an outstandingly hard time to even come close to being able
> to translate the following string sequences in one or a series of
> regular expressions. These are allowable prefix combinations in a
> language I'm doing some text analysis on.
>
> i pa(ki) pag
> ma i ka pa(ki) pag
> ma pa ka
> ma(ki) pag ka
> man
> m(an) ag ka sin
> m(an) ag kan(da)
> m(an) ag si(pag)
> m(an) ag pa
>
> The first one allows the following choices:
> i, pa, pag, ipa, ipag, ipapag, papag, ipaki, ipakipag, paki, pakipag
>
> The syllable inside parentheses can only be selected if the syllable or
> letter it is attached to is selected.
>

Maybe these are a bit naive, but they seem to (?) work for the tests I
could come up with...

1) /\Ai?(pa(ki)?)?(pag)?\Z/
2) /\A(ma)?(i)?(ka)?(pa(ki)?)?(pag)?\Z/
3) /(ma)?(pa)?(ka)?/
4) /\A(ma(ki)?)?(pag)?(ka)?\Z/
5) /man/
6) /\A(m(an)?)?(ag)?(ka)?(sin)?\Z/
7) /\A(m(an)?)?(ag)?(kan(da)?)?\Z/
8) /\A(m(an)?)?(ag)?(si(pag)?)?\Z/
9) /\A(m(an)?)?(ag)?(pa)?\Z/

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Ross Bamford

12/21/2005 11:35:00 PM

On Wed, 21 Dec 2005 23:29:19 -0000, Ross Bamford
<rosco@roscopeco.remove.co.uk> wrote:

> Maybe these are a bit naive, but they seem to (?) work for the tests I
> could come up with...
>

Except of course, they match the empty string :P

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Rich Morin

12/21/2005 11:43:00 PM

My first cut at the problem would be to

* write a generator that translates string sequences

i pa(ki) pag

into persistent lists of prefixes

i, pa, pag, ipa, ipag, ...

* write a method that uses these prefixes,
as is: /^i/, /^pa/, ...

As Ken Thompson said, "When in doubt, use brute force".

Once this worked, I'd check to see if it was "fast enough".
If not, I'd look into reworking the methods. The important
issues are (a) getting something working quickly and (b)
keeping the implementation details hidden. If speed is a
real issue, you may need to go to a fairly fancy solution,
so don't tie yourself to regexes!

-r
--
email: rdm@cfcl.com; phone: +1 650-873-7841
http://ww... - Canta Forda Computer Laboratory
http://ww.../Meta - The FreeBSD Browser, Meta Project, etc.

basi

12/22/2005 3:45:00 AM

Hi,
Speed is not an issue at this point. As you say, just get it working
for now. What would be a "fancy" solution?

Right now, my solution is bruter than brute force: Take the first 9
syllables (prefixes can pile up to syllables deep, which include
reduplicated syllables and an infix somewhere there. The data in my OP
does not indicate these, but reduplications and infixation are easy to
wedge into the solution). Match it with the prefix list. If no match,
take the first 8 syllables, and so forth. Shameless.

Thanks!
basi

basi

12/22/2005 3:51:00 AM

Hi,
Thanks for your suggestion. I'm checking it right now for under/over
generation!
basi

Mark J. Reed

12/22/2005 5:59:00 AM

"basi" <basi_lio@hotmail.com> writes:

>Hello,
>Just having an outstandingly hard time to even come close to being able
>to translate the following string sequences in one or a series of
>regular expressions. These are allowable prefix combinations in a
>language I'm doing some text analysis on.

>i pa(ki) pag
>The first one allows the following choices:
>, pag, ipa, ipag, ipapag, papag, ipaki, ipakipag, paki, pakipag

You need to use alternatives to handle the nonempty constraint.
For instance, the basic structure for the first sequence is this:

/^(strings with i|strings with pa(ki)|strings with pag)$/

For instance, this works:

/^(i(pa(ki)?)?(pag)?|i?pa(ki)?(pag)?|i?(pa(ki)?)?pag)$/

But it's redundant, since several strings will match more than one of the
alternatives. For instance, the first alt takes care of all the strings with
i, so there's no need for i? in the other two parts. Similarly, the first and
second parts together handle all the strings with pa (with or without i- and
with or without -ki), so there's no need to include them in the third alt.
This matches all the valid strings, and I haven't found an invalid string
that it matches - but it's almost 1 AM, so I could be missing something. :)

/^(i(pa(ki)?)?(pag)?|pa(ki)?(pag)?|pag)$/

If you need to worry about the difference between lines and strings, you should
use \A and \Z instead of ^ and $. It may be more efficient to use
non-capturing parens (?:...) instead of the plain ones, but I think it
makes it harder to read and type.

comp.lang.ruby

Regex: Up for a challenge?

basi

Ross Bamford

Ross Bamford

Rich Morin

basi

basi

Mark J. Reed

x Login to ForumsZone