Eric Mahurin
9/28/2008 3:30:00 AM
[Note: parts of this message were removed to make it a legal post.]
On Sat, Sep 27, 2008 at 1:04 PM, <parrt@antlr.org> wrote:
> On Sep 26, 10:30 pm, Eric Mahurin <eric.mahu...@gmail.com> wrote:
> > On Fri, Sep 26, 2008 at 1:19 AM, <pa...@antlr.org> wrote:
> > > As for packrat etc... Turn on backtrack=true and ANTLR throttles up
> > > from LL(1) to LL(*) to a PEG as necessary. Memoization only occurs if
> > > you say so, either on grammar or on rule. I have plans to have ANTLR
> > > suggest where you should memoize.
> >
> > This sounds very good. I'd like to see how you did some of this.
>
> There is some really amazing jiggery-pokery in the grammar analyzer.
> Sometimes I look at it and say, "holy crap! it actually works!" ;)
>
> > My parser
> > generator does LL(1) and LL(*) where directed.
>
> Just to prevent confusion in the namespace, I assume you mean LL(k)
> not LL(*). LL(k) does more than a single symbol of lookahead but with
> a finite maximum determined at parser generation time. LL(*), on the
> other hand, can generate cyclic DFA that can look arbitrarily ahead to
> distinguish alternatives. As long as the lookahead language is
> regular, LL(*) can handle it. If you have a left common prefix that is
> recursive, naturally that means the lookahead language is not regular.
> At that point, we must backtrack. That is where we get into predicated
> LL(*), which knows how to backtrack and incorporate semantic
> predicates its own. In fact I implement syntactic predicates,
> backtracking, as a special form of semantic predicate. this has
> advantages as you'll see next.
I do mean LL(*). I decided up front to make my parser mainly LL(1) for
speed and clarity but give a backdoor to accomplish LL(*) via backtacking.
>
> When you set
> > "backtrack=true", does it generate backtracking code only where the
> grammar
> > really needs it (a lookahead token/character could match two
> alternatives?)
> > or for (almost) all alternatives? Now I'm curious to see what kind of
> code
> > ANTLR generates, especially with selective memoization.
>
> Yes, ANTLR does the optimal thing. Imagine two alternatives that are
> distinguishable with a single symbol of lookahead in 90% of the cases.
> But, in 9% of the cases it requires three token lookahead and in 1% of
> the cases it requires a full backtrack. ANTLRWill generate a parsing
> decision that does exactly that. It will never, even in the same
> parsing decision, default to using the maximum strength in all input
> sequence cases. Without a lot of fancy pants grammar analysis and DFA
> generation, this can't be done.
Very nice...
Pure packrat parsing is really easy. There is no analysis to be done
> all. You just always backtrack. PEGs have a number of disadvantages.
> Two of which are that it cannot handle infinite streams like socket
> protocols. Also, if you are always backtracking, the error message is
> essentially "no, Your input does not match the syntax". Also, action
> execution is extremely difficult in the face of backtracking,
> particularly continuously. Consequently, it is my opinion that pure
> PEG based parser generators are not viable commercially unless you
> don't care about invalid input, you don't care about infinite parsing
> strings, and you don't care about action execution (you only need an
> AST or parse tree from the parse). For some projects, actually make
> sense, but for most commercial applications you'll need something like
> ANTLR's approach where it tries to avoid backtracking at all cost.
>
I completely agree and came to the exact same conclusions a long time ago
(after starting down the auto-backtrack route).
>
> > > As an ex-ANTLR guy, I've always found you to be nice. I learned quite
> a bit
> > about parsing while using ANTLR that I wanted to go off and design my
> own.
> > I've counted a bunch of others that did the same. That says a lot. I
> was
> > much easier to wrap your head around parsing when you saw the recursive
> > descent LL code that ANTLR generated compared to the tables of *ACC.
>
> Hooray! Yeah, I applaud everybody's attempt to build their own parser
> generator. Besides being fun, it helps to push the state-of-the-art.
> Bryan Ford's packrat parsing really helped me improve ANTLR. He and I
> are scheduled to write an LL(*) paper sometime this year.
>
> > I finally got my parser generator setup for re-targeting. The code
> > generator is more generically a grammar tree traverser. There are a ton
> of
> > possibilities for a given traverser: generate code in a specific
> language,
> > parse while traversing (no code generation), generate a different type of
> > parser (LALR?), generate grammar diagrams, etc.
>
> The the question in terms of retargetability is whether or not you can
> change all of your internal data structures without affecting the code
> generation very much. ;) Be careful you're not cutting and pasting
> logic of code generation when you make a new target. If so, it's not
> really retargetable. :)
Good point. This is something I struggled with for a while. I decided to
go with a fairly generic layer to hold the grammar tree and put most of the
complexity in the code generation engine. This gives a lot of flexibility
to the engines, but makes them more complex. But, the engines don't need to
handle higher level combinators built on lower level ones. For example,
"one-or-more", "zero-or-more", and repeat by a count/range are all built
with a recursion function that detects left and right recursion (and builds
loops).
But, point taken. I could think about refactoring some between the layers
to minimize any code duplication. Another option might be helper
methods/functions.
> Thanks for ANTLR!
>
> My pleasure. You know, I just realized it's been exactly 20 years
> since I started building yucc, the predecessor of ANTLR! Wow, one of
> these days, I had better get another interest! ;)
>
> Ter
>
>