Eric Mahurin
9/23/2008 1:58:00 PM
[Note: parts of this message were removed to make it a legal post.]
On Tue, Sep 23, 2008 at 4:20 AM, Clifford Heath <no@spam.please.net> wrote:
> Eric Mahurin wrote:
>
>> I still don't know enough about packrat parsing yet. Do you think it
>> possible for this type of parser to reach the same performance as LL (or
>> LALR) parsers or is there just extra overhead that you need at parse time
>>
>
> Unless you can analyse the heck out of the source grammar, as ANTLR does,
> I think there's always going to be an overhead. Terence Parr's achievement
> is to figure out where backtracking might be needed and do it efficiently
> only when needed.
>
> Treetop could be improved with some local optimisations, like using Regexp,
> and perhaps looking at the rest of the rule to reduce memoization for
> example,
> which might pick up 10x improvement or mor, but I expect there'll always be
> a
> a large overhead over using a hand-crafted parser. There's a suggestion in
> the
> wind to implement "skip" rules, that produce no syntax nodes, for things
> like
> whitespace skipping.
>
> Currently, saying "space*" will generate one node for *each* space - Regexp
> would fix that. So even for a PEG parser, Treetop is unnecessarily slow.
>
> It does handle lexer-free parsers (but you
>> can have a lexer) and handles LL(*) parsing, but with a performance
>> penalty
>> - backtracking. From my understanding, packrat parsers shine in
>> backtracking by memoizing to maintain linear performance. I'm wondering
>> if
>> I could use this technique and still maintain my non-backtracking
>> performance.
>>
>
> The major part of the cost is the construction of so many objects.
> If you provide memoize-when-hinted, I think that'd be best.
> If you also provided a sweet metagrammar (I find your earlier Grammar
> examples make my eyes bleed), I'd be onto it like a shot. I need the
> prospect of non-Ruby code generation however...
>
I'm curious, could you give an example of what makes your eyes bleed? I
would like to improve the usability where possible. The Grammar layer
itself will always be just ruby objects with a ruby DSL. But, I would like
to provide translators from various formats (BNF, PEG, regex, etc) to
Grammar objects. I'm sure I could even give you a faster replacement for
Regexp (using ruby2cext). These translators would simply be Grammars that
build Grammars (or maybe the code, so you could see/edit it).
Also, with Grammar 0.8, I've completely separated the parser generation.
Grammar is just a light-weight layer used to hold the description of the
language you are parsing. The real work is in the "engine" that does parser
generation. There aren't many bounds as to what an "engine" could do with a
Grammar. An engine could:
- generate a parser in another target language (engine might allow an
"action" to be simply a code string)
- directly parse while traversing a Grammar
- generate another type of parser - LALR, packrat ???
- build a Graphviz diagram of a Grammar tree
- "inspect" a Grammar (Grammar#inspect isn't useful because every Grammar
level mainly just has a lambda)
- translate to another format - regex, BNF, PEG
Right now I've implemented only a few engines (some built on each other) -
Ruby (generates very flat LL ruby parsers), RubyCall (puts method calls in
certain places), Ruby0 (parses directly instead of generating a parser),
Ruby2CExt (uses ruby2cext to generate a rubyC parser).
Perhaps an option to "memoize everything" could gather statistics based
> on actual parser behaviour with real input, and produce the backtracking
> hints automatically? That'd be really sweet, because you wouldn't need to
> be a language wizard to work out where to put them.
Personally, I don't like the idea of automatic backtracking. Regexp does
this. The big problem I see with auto-backtracking is that when the parse
fails, you just get a simple "it failed" response. Or you could possibly
generate a tree of possible failing locations (and expectations) in the
input. Nothing as simple as "expected this, got that, at line 123". Ever
try to debug a large Regexp? If it doesn't parse your input, it just
returns false and you have nothing to go on. I see the auto-backtracking
feature of Regexp as the reason it can't do much more than that. If/when I
make a Regexp translator, it would generate a Grammar that can backtrack on
every alternative (except with the (?>...) no backtrack construct) to be
compatible. I guess an "engine" could also backtrack by default.
In my very first incarnation of Grammar (called Syntax on RAA), I did do
auto-backtracking, but found the above problem. I'd rather it be the
exception than the rule.
Does Treetop auto-backtrack? How do you deal with error reporting?
If backtracking isn't always on, do you memoize only when you might possibly
backtrack? In Grammar, I'm wondering if it is possible to take to a
memoizing performance hit only when backtracking is allowed.
I thinking I could also make an "engine" for Grammar that did
> packrat or even LALR parsing instead of LL parsing.
>
Nathan was thinking about implementing Treetop using a VM, Truth is,
> if the VM had the right hints, it'd be awesome.
I'd also could make a Grammar engine for generating VM bytecode if that were
possible.
Also, any of you have JSON parser for ANTLR with a Ruby target? JSON is
>> what I've been benchmarking with (because of the ruby quiz) and I'd like
>> to
>> compare against ANTLR.
>>
>
> Not I, but I'd be surprised if Google doesn't find one. It'd be pretty
> simple to throw one together anyhow.
I already found it on the antlr site, but not with a ruby target. I didn't
want to go relearn antlr (v3 now) and how to get it to generate ruby.
Have you used ANTRWorks BTW? It's excellent!
Once. I'd like to make something like this for Grammar (at least an engine
that graphically displays a Grammar).
Eric