Clifford Heath
1/29/2008 5:06:00 AM
Phrogz wrote:
> On Jan 28, 4:57 pm, Clifford Heath <n...@spam.please.net> wrote:
>> This rule is equivalent:
>> rule CFWS
>> ( FWS / comment )+
>> end
>> Just a bit simpler, eh?
> You know, I looked at that rule (and some like it) and said "Man, what
> were they smoking!?" But yeah, this was a quick translation over lunch
> and I explicitly didn't want to try translating anything.
>
> To be fair, that's not quite the same rule, right? Your rule would
> allow FWS FWS FWS, while the original requires a comment between them.
Well, yes-ish. If you look at the definition of FWS, it matches any non-zero
amount of whitespace, as long as it has only one CRLF (I missed this 2nd
restriction when checking equivalence). The easiest way to fix this is to
separate the CRLF from other whitespace, and ensure that there is comment
text (potentially containing whitespace) on every line except the first and
last, or something like that. An exercise for the reader ;-).
> I was happy to accept the no-left-recursion limitation of the grammar,
> because it's easy to spot and the downside is infinite recursion.
Yes, that's not something that's actually happened to me, so I've never
even needed to spot it :-). I ran into it big-time when I first used
TXL though!
> But what you describe here seems far more insidious. This limitation
> (which I assume is not in fact limited to PEG in general, but
> specifically to the packrat memoization technique that makes this
> solution perform reasonably)
Correct, I meant the packrat technique, not PEG.
> says to me: There are some cases where
> you can write something that looks correct and technically *is*
> correct, but that will silently fail on otherwise valid content.
There's a reason why Nathan points out that among a set of alternates,
the *first* one that succeeds is taken. That means, even if that makes a
higher-level rule fail, it won't try subsequent alternatives. Yes, it's
a trap, but once you get into the right thought pattern, it's not that
hard to avoid. You basically need to consider where you might mis-read
the text yourself, and introduce negative lookahead that prevents the
mis-reading.
Oh, and use TDD :-). You need that to ensure that whitespace is
skipped, or expected, everywhere it should, if for no other reason.
You just won't get it right with TT otherwise... but you won't with
any other tool either.
> Can you provide any guidelines or reading points on how to recognize a
> set of rules that may fail like this? Is it limited to use of the zero-
> or-more star symbol? Particularly likely to occur when any star or
> plus quantifier appears at the start of a rule?
Not just at the start of a rule, but anywhere in the rule. You need to
ensure that you insert a !wrong_path predicate to limit the repetition,
any time a repetition might chew too much. Likewise for alternation,
make sure you get the alternates in the right order, and dress any with
negative lookahead when an early choice might be wrong.
> Thanks again for your help. I'm still excited about treetop, but less
> so as I find that I actually have to know some theory about the
> language I'm writing in (*gasp*) instead of being able to naively
> throw together BNF-like rules and have it magically all work out. ;)
Have a play with ANTLRworks, which will analyze things and tell you when
you have a problem, and when you find that ANTLR's other restrictions
begin to frustrate the crap out of you (as happened to me), come back to
using Treetop. :-)
Does anyone else wish for the ability of Treetop to emit languages
other than Ruby? I'm probably going to need C# one day, and perhaps
Java.
Clifford Heath.