Asp Forum - ANTLR Target for Ruby

arcadio

9/16/2008 4:24:00 PM

Hi everyone,

I've asked the following question in the ANTLR mailing list, with no
luck. Since it's quite Ruby-centric, perhaps someone here knows the
answer, given the popularity of ANTLR.

I'm in the process of migrating a YACC grammar to ANTLR.

However, since the rest of the project is written in Ruby, I'd like to
generate the ANTLR code in Ruby, if possible. Currently I'm generating
Java and using JRuby to circumvent this limitation. I'd prefer to use
MRI-Ruby, though

I've read the wiki page related to the Ruby Target (http://
www.antlr.org/wiki/display/ANTLR3/Antlr3RubyTarget), and the blog of
its developer (http://split-s.blogspot.com/2005/12/...
ruby.html). The development seems to have halted in version 3.0b4.
This version could suit my needs as I'm only using basic Lexer and
Parser generation. No tree parsers.

Does anyone know where to get this version? It's not listed in the
Download directory (http://www.antlr.org...) and I couldn't
find it in the development repository. Has anyone used it
successfully?

Thanks in advance.

28 Answers

Clifford Heath

9/21/2008 12:59:00 PM

arcadio wrote:
> I've asked the following question in the ANTLR mailing list, with no
> luck. Since it's quite Ruby-centric, perhaps someone here knows the
> answer, given the popularity of ANTLR.
>
> I'm in the process of migrating a YACC grammar to ANTLR.
>
> However, since the rest of the project is written in Ruby, I'd like to
> generate the ANTLR code in Ruby, if possible.

Have you tried just adding the required options section below your grammar
statement? It just worked for me, though the generated code is a little
quirky. The Ruby generator has been a standard part of ANTLR for years.

options {
language = Ruby;
output = AST;
}

I found that ANTLR was very poor at reporting situations for which it
couldn't generate correct code (or rather, the designer has a strange
idea of "correct"), and in particular the lexing behaviour is very
non-intuitive. I was attempting a very difficult grammar that needs
LL(*) behaviour, and as I know now, needs to handle lexing using parse
rules (no lexer at all). After some unhelpful arguments, I gave up and
implemented my grammar using Treetop. Terribly slow, but powerful and
simple to use, once you get the hang of PEG parsing.

Clifford Heath.

Eric Mahurin

9/21/2008 2:38:00 PM

[Note: parts of this message were removed to make it a legal post.]

On Sun, Sep 21, 2008 at 7:56 AM, Clifford Heath <no@spam.please.net> wrote:

> arcadio wrote:
>
>> I've asked the following question in the ANTLR mailing list, with no
>> luck. Since it's quite Ruby-centric, perhaps someone here knows the
>> answer, given the popularity of ANTLR.
>>
>> I'm in the process of migrating a YACC grammar to ANTLR.
>>
>> However, since the rest of the project is written in Ruby, I'd like to
>> generate the ANTLR code in Ruby, if possible.
>>
>
> Have you tried just adding the required options section below your grammar
> statement? It just worked for me, though the generated code is a little
> quirky. The Ruby generator has been a standard part of ANTLR for years.
>
> options {
> language = Ruby;
> output = AST;
> }
>
> I found that ANTLR was very poor at reporting situations for which it
> couldn't generate correct code (or rather, the designer has a strange
> idea of "correct"), and in particular the lexing behaviour is very
> non-intuitive. I was attempting a very difficult grammar that needs
> LL(*) behaviour, and as I know now, needs to handle lexing using parse
> rules (no lexer at all). After some unhelpful arguments, I gave up and
> implemented my grammar using Treetop. Terribly slow, but powerful and
> simple to use, once you get the hang of PEG parsing.
>
> Clifford Heath.

Hey Clifford,

I still don't know enough about packrat parsing yet. Do you think it
possible for this type of parser to reach the same performance as LL (or
LALR) parsers or is there just extra overhead that you need at parse time
(as opposed to accounted for at parser generation time)? In my
benchmarking, the 3 packrat parsers I looked at (Treetop, Ghostwheel, and
Peggy) are 10-100 X slower than pure ruby LL and LALR parsers.

I also was a previous ANTLR user (you might find my C preprocessor in ANTLR
2.7.6 releases) before writing my own parser. I decided to stick with LL
parsing for my Grammar project. It does handle lexer-free parsers (but you
can have a lexer) and handles LL(*) parsing, but with a performance penalty
- backtracking. From my understanding, packrat parsers shine in
backtracking by memoizing to maintain linear performance. I'm wondering if
I could use this technique and still maintain my non-backtracking
performance. I thinking I could also make an "engine" for Grammar that did
packrat or even LALR parsing instead of LL parsing. Independently I should
be able to translate a PEG syntax, Regexp syntax, or BNF (YACC/ANTLR) to a
Grammar.

Also, any of you have JSON parser for ANTLR with a Ruby target? JSON is
what I've been benchmarking with (because of the ruby quiz) and I'd like to
compare against ANTLR.

Eric

Clifford Heath

9/23/2008 9:18:00 AM

Eric Mahurin wrote:
> I still don't know enough about packrat parsing yet. Do you think it
> possible for this type of parser to reach the same performance as LL (or
> LALR) parsers or is there just extra overhead that you need at parse time

Unless you can analyse the heck out of the source grammar, as ANTLR does,
I think there's always going to be an overhead. Terence Parr's achievement
is to figure out where backtracking might be needed and do it efficiently
only when needed.

Treetop could be improved with some local optimisations, like using Regexp,
and perhaps looking at the rest of the rule to reduce memoization for example,
which might pick up 10x improvement or mor, but I expect there'll always be a
a large overhead over using a hand-crafted parser. There's a suggestion in the
wind to implement "skip" rules, that produce no syntax nodes, for things like
whitespace skipping.

Currently, saying "space*" will generate one node for *each* space - Regexp
would fix that. So even for a PEG parser, Treetop is unnecessarily slow.

> It does handle lexer-free parsers (but you
> can have a lexer) and handles LL(*) parsing, but with a performance penalty
> - backtracking. From my understanding, packrat parsers shine in
> backtracking by memoizing to maintain linear performance. I'm wondering if
> I could use this technique and still maintain my non-backtracking
> performance.

The major part of the cost is the construction of so many objects.
If you provide memoize-when-hinted, I think that'd be best.
If you also provided a sweet metagrammar (I find your earlier Grammar
examples make my eyes bleed), I'd be onto it like a shot. I need the
prospect of non-Ruby code generation however...

Perhaps an option to "memoize everything" could gather statistics based
on actual parser behaviour with real input, and produce the backtracking
hints automatically? That'd be really sweet, because you wouldn't need to
be a language wizard to work out where to put them.

> I thinking I could also make an "engine" for Grammar that did
> packrat or even LALR parsing instead of LL parsing.

Nathan was thinking about implementing Treetop using a VM, Truth is,
if the VM had the right hints, it'd be awesome.

> Also, any of you have JSON parser for ANTLR with a Ruby target? JSON is
> what I've been benchmarking with (because of the ruby quiz) and I'd like to
> compare against ANTLR.

Not I, but I'd be surprised if Google doesn't find one. It'd be pretty
simple to throw one together anyhow.

Have you used ANTRWorks BTW? It's excellent!

Clifford Heath.

Eric Mahurin

9/23/2008 1:58:00 PM

[Note: parts of this message were removed to make it a legal post.]

On Tue, Sep 23, 2008 at 4:20 AM, Clifford Heath <no@spam.please.net> wrote:

> Eric Mahurin wrote:
>
>> I still don't know enough about packrat parsing yet. Do you think it
>> possible for this type of parser to reach the same performance as LL (or
>> LALR) parsers or is there just extra overhead that you need at parse time
>>
>
> Unless you can analyse the heck out of the source grammar, as ANTLR does,
> I think there's always going to be an overhead. Terence Parr's achievement
> is to figure out where backtracking might be needed and do it efficiently
> only when needed.
>
> Treetop could be improved with some local optimisations, like using Regexp,
> and perhaps looking at the rest of the rule to reduce memoization for
> example,
> which might pick up 10x improvement or mor, but I expect there'll always be
> a
> a large overhead over using a hand-crafted parser. There's a suggestion in
> the
> wind to implement "skip" rules, that produce no syntax nodes, for things
> like
> whitespace skipping.
>
> Currently, saying "space*" will generate one node for *each* space - Regexp
> would fix that. So even for a PEG parser, Treetop is unnecessarily slow.
>
> It does handle lexer-free parsers (but you
>> can have a lexer) and handles LL(*) parsing, but with a performance
>> penalty
>> - backtracking. From my understanding, packrat parsers shine in
>> backtracking by memoizing to maintain linear performance. I'm wondering
>> if
>> I could use this technique and still maintain my non-backtracking
>> performance.
>>
>
> The major part of the cost is the construction of so many objects.
> If you provide memoize-when-hinted, I think that'd be best.
> If you also provided a sweet metagrammar (I find your earlier Grammar
> examples make my eyes bleed), I'd be onto it like a shot. I need the
> prospect of non-Ruby code generation however...
>

I'm curious, could you give an example of what makes your eyes bleed? I
would like to improve the usability where possible. The Grammar layer
itself will always be just ruby objects with a ruby DSL. But, I would like
to provide translators from various formats (BNF, PEG, regex, etc) to
Grammar objects. I'm sure I could even give you a faster replacement for
Regexp (using ruby2cext). These translators would simply be Grammars that
build Grammars (or maybe the code, so you could see/edit it).

Also, with Grammar 0.8, I've completely separated the parser generation.
Grammar is just a light-weight layer used to hold the description of the
language you are parsing. The real work is in the "engine" that does parser
generation. There aren't many bounds as to what an "engine" could do with a
Grammar. An engine could:

- generate a parser in another target language (engine might allow an
"action" to be simply a code string)
- directly parse while traversing a Grammar
- generate another type of parser - LALR, packrat ???
- build a Graphviz diagram of a Grammar tree
- "inspect" a Grammar (Grammar#inspect isn't useful because every Grammar
level mainly just has a lambda)
- translate to another format - regex, BNF, PEG

Right now I've implemented only a few engines (some built on each other) -
Ruby (generates very flat LL ruby parsers), RubyCall (puts method calls in
certain places), Ruby0 (parses directly instead of generating a parser),
Ruby2CExt (uses ruby2cext to generate a rubyC parser).

Perhaps an option to "memoize everything" could gather statistics based
> on actual parser behaviour with real input, and produce the backtracking
> hints automatically? That'd be really sweet, because you wouldn't need to
> be a language wizard to work out where to put them.

Personally, I don't like the idea of automatic backtracking. Regexp does
this. The big problem I see with auto-backtracking is that when the parse
fails, you just get a simple "it failed" response. Or you could possibly
generate a tree of possible failing locations (and expectations) in the
input. Nothing as simple as "expected this, got that, at line 123". Ever
try to debug a large Regexp? If it doesn't parse your input, it just
returns false and you have nothing to go on. I see the auto-backtracking
feature of Regexp as the reason it can't do much more than that. If/when I
make a Regexp translator, it would generate a Grammar that can backtrack on
every alternative (except with the (?>...) no backtrack construct) to be
compatible. I guess an "engine" could also backtrack by default.

In my very first incarnation of Grammar (called Syntax on RAA), I did do
auto-backtracking, but found the above problem. I'd rather it be the
exception than the rule.

Does Treetop auto-backtrack? How do you deal with error reporting?

If backtracking isn't always on, do you memoize only when you might possibly
backtrack? In Grammar, I'm wondering if it is possible to take to a
memoizing performance hit only when backtracking is allowed.

I thinking I could also make an "engine" for Grammar that did
> packrat or even LALR parsing instead of LL parsing.
>

Nathan was thinking about implementing Treetop using a VM, Truth is,
> if the VM had the right hints, it'd be awesome.

I'd also could make a Grammar engine for generating VM bytecode if that were
possible.

Also, any of you have JSON parser for ANTLR with a Ruby target? JSON is
>> what I've been benchmarking with (because of the ruby quiz) and I'd like
>> to
>> compare against ANTLR.
>>
>
> Not I, but I'd be surprised if Google doesn't find one. It'd be pretty
> simple to throw one together anyhow.

I already found it on the antlr site, but not with a ruby target. I didn't
want to go relearn antlr (v3 now) and how to get it to generate ruby.

Have you used ANTRWorks BTW? It's excellent!

Once. I'd like to make something like this for Grammar (at least an engine
that graphically displays a Grammar).

Eric

James Gray

9/23/2008 2:08:00 PM

On Sep 23, 2008, at 8:58 AM, Eric Mahurin wrote:

> I already found it on the antlr site, but not with a ruby target. I
> didn't want to go relearn antlr (v3 now) and how to get it to
> generate ruby.

Hopefully it's fixed now, but last time I tried using ANTLR for Ruby
output it was very broken. That was a while ago though (shortly after
version three released, I think).

James Edward Gray II

Ryan Davis

9/23/2008 6:31:00 PM

On Sep 23, 2008, at 07:08 , James Gray wrote:

> On Sep 23, 2008, at 8:58 AM, Eric Mahurin wrote:
>
>> I already found it on the antlr site, but not with a ruby target.
>> I didn't want to go relearn antlr (v3 now) and how to get it to
>> generate ruby.
>
> Hopefully it's fixed now, but last time I tried using ANTLR for Ruby
> output it was very broken. That was a while ago though (shortly
> after version three released, I think).

I doubt it is fixed... I tried to submit patches to get the ruby code
to work and Terr was a PITA to no end so I walked away. He later came
to me asking if I wanted to maintain the ruby side, but... too little,
too late.

I just checked perforce and the last change to ruby runtime was:

Change 3846 on 2007/06/26 by martin@martin.MartinXP.antlr3 '- fixed
various bugs related...'

My last emails from Terr were 2007-08, so ... no, probably not fixed
in the slightest.

James Gray

9/23/2008 7:20:00 PM

On Sep 23, 2008, at 1:30 PM, Ryan Davis wrote:

>
> On Sep 23, 2008, at 07:08 , James Gray wrote:
>
>> On Sep 23, 2008, at 8:58 AM, Eric Mahurin wrote:
>>
>>> I already found it on the antlr site, but not with a ruby target.
>>> I didn't want to go relearn antlr (v3 now) and how to get it to
>>> generate ruby.
>>
>> Hopefully it's fixed now, but last time I tried using ANTLR for
>> Ruby output it was very broken. That was a while ago though
>> (shortly after version three released, I think).
>
> I doubt it is fixed... I tried to submit patches to get the ruby
> code to work and Terr was a PITA to no end so I walked away. He
> later came to me asking if I wanted to maintain the ruby side,
> but... too little, too late.
>
> I just checked perforce and the last change to ruby runtime was:
>
> Change 3846 on 2007/06/26 by martin@martin.MartinXP.antlr3 '- fixed
> various bugs related...'
>
> My last emails from Terr were 2007-08, so ... no, probably not fixed
> in the slightest.

Yeah, I too felt like Ruby was a second class citizen. I was trying
to make the switch to ANTLR as my parsing tool, but their week support
ran me off.

James Edward Gray II

Ryan Davis

9/23/2008 9:57:00 PM

On Sep 23, 2008, at 12:19 , James Gray wrote:

> Yeah, I too felt like Ruby was a second class citizen. I was trying
> to make the switch to ANTLR as my parsing tool, but their week
> support ran me off.

I doubt they even gave it that long... *rimshot*

Clifford Heath

9/24/2008 2:01:00 AM

Eric Mahurin wrote:
>> The major part of the cost is the construction of so many objects.
>> If you provide memoize-when-hinted, I think that'd be best.
>> If you also provided a sweet metagrammar (I find your earlier Grammar
>> examples make my eyes bleed), I'd be onto it like a shot. I need the
>> prospect of non-Ruby code generation however...
> I'm curious, could you give an example of what makes your eyes bleed?

I must admit I haven't looked closely at Grammar for a while, however...

DSLs are fine for folk who don't know how to make proper grammars :-).
They tend to have a somewhat high signal-to-noise ratio, by which I mean
there are too many extraneous characters/symbols, or symbols that aren't
intuitively communicative.

Every non-essential symbol in a sentence is a significant barrier
to a newcomer, though experienced users learn to skip them and don't
understand what the fuss is about. The problem is that the difficulty
of grasping which symbols are significant and which aren't goes with
the *square* of the S/N ratio.

XML is a perfect example of how to do this wrongly :-) sendmail.cf
and Perl are even better examples ;-).

> I would like
> to provide translators from various formats (BNF, PEG, regex, etc) to
> Grammar objects.

That's what I'm talking about, good stuff.

> Also, with Grammar 0.8, I've completely separated the parser generation.
> Grammar is just a light-weight layer used to hold the description of the
> language you are parsing.

So you're already halfway to a VM, where the generated parser doesn't have
to be human-readable at all :-)

> Personally, I don't like the idea of automatic backtracking. Regexp does
> this. The big problem I see with auto-backtracking is that when the parse
> fails, you just get a simple "it failed" response.

Not the case with Treetop. It assumes that the furthest point reached is
the failure point (not often true but an assumption that works, given...)
and lists the text leading up to that, and enumerates the terminal tokens
that would have allowed it to get further. This works *really* well.
Try it - call the "failure_reason" method on a failed parse to see it
in action.

> Nothing as simple as "expected this, got that, at line 123".

It's exactly as simple as that. The diagnosis might be wrong, but to
the human, it's a comprehensible mistake.

> Does Treetop auto-backtrack?

That's what memoizing is for - it's the core of how PEG works.

> If backtracking isn't always on, do you memoize only when you might possibly
> backtrack? In Grammar, I'm wondering if it is possible to take to a
> memoizing performance hit only when backtracking is allowed.

I don't think you can guess when backtracking will be needed.
As I said, it requires heavy analysis - Terrence didn't get his
PhD for nothing!

But if you always allow backtracking ala PEG, but only memoize in
places where experience (aka statistics collected) have shown it's
needed, you could have the best of both worlds.

> I already found it on the antlr site, but not with a ruby target. I didn't
> want to go relearn antlr (v3 now) and how to get it to generate ruby.

You just ask for Ruby in your options block at the top of the grammar.
The Ruby target is quite strange and limited though IIRC. The lexer
in all ANTLR grammars is definitely non-intuitive, it doesn't choose
the longest possible match like every other lexer in the world. Terr
isn't concerned about his users, just about big-noting himself. I
wouldn't say "their weak support ran me off" so much as just "they ran
me off". When he pisses someone off, that just proves his superiority
more... and I have the emails to prove it. You mighta thought DHH had
an attitude problem, heh, he's tame!

> Have you used ANTRWorks BTW? It's excellent!
> Once. I'd like to make something like this for Grammar (at least an engine
> that graphically displays a Grammar).

ANTLRWorks shows railroad diagrams, which I used to produce syntax
documentation for CQL. But the ability to interactively test individual
rules is great. A lot of work though - I'd have thought your skills
were better spent elsewhere.

Clifford Heath.

Wincent Colaiuta

9/24/2008 8:48:00 AM

I wrote a PEG parser generator a couple of years ago to support my
Walrus object-oriented templating system (inspired by the Cheetah
template engine in Python, and carrying over a lot of its syntax and
directives). Actually, the reason I did this was because I didn't have
any luck getting ANTLR to do what I wanted it to do.

The parser generator uses a Ruby DSL to specify the grammar,
dynamically constructing the parser on the fly. To get an idea of what
the DSL looks like and what it can do, check out the parser for the
Walrus language itself; it is a fairly complex example because it
demonstrates things like "here documents", include files, and parses a
subset of Ruby itself in addition to the Walrus markup:

http://git.wincent.com/Walrus.git?a=blob;f=lib/walrus...

Like Treetop, it is fairly slow, but it works. The major performance
penalty actually seems to be the memory bloat associated with
memoization. There is also a resource leak in there where a lot of the
memoized data doesn't appear to be getting garbage-collected when you
run the parser over lots of documents in the same session; never got
to the bottom of that one despite spending quite a few hours on it. My
workaround was to process multiple files in batches, running Walrus
once for each file rather than passing it all the files at once. Not
very elegant but it worked.

Given that it works I haven't really done any more development on it.
But an academic itch that I've wanted to scratch for some time now has
been turning it into a performance powerhouse; either by replacing the
dynamically generated parser with a custom Ragel lexer (packaged as a
C extension for Ruby) or by doing static grammar analysis like ANTLR
does to identify places where prediction could be used to circumvent
unnecessary backtracking and jump straight to the right alternative.

But I think both of those options would require me to lose the Ruby
DSL and switch to a custom, non-Ruby language for specifying grammars.
One of the things I always liked about the system was that you wrote
your grammars in Ruby. Evidently, you can only do static analysis once
you've parsed the grammar, and that would be fiendishly difficult (or
impossible?) to do with a dynamically-constructed one built on the fly
by evaluating the Ruby DSL.

Cheers,
Wincent

comp.lang.ruby

ANTLR Target for Ruby

arcadio

Clifford Heath

Eric Mahurin

Clifford Heath

Eric Mahurin

James Gray

Ryan Davis

James Gray

Ryan Davis

Clifford Heath

Wincent Colaiuta

x Login to ForumsZone