Asp Forum - Treetop Email Parser

Phrogz

1/28/2008 10:23:00 PM

(I would post this to the treetop mailing list...except there doesn't
seem to be one.)

Warming up to treetop, at lunch today I ported the official RFC2822
email address specification to treetop. (As a grammar that allows
recursive comments, this is notoriously impossible to match exactly
correctly using regular expressions.)

C:\>irb
irb(main):001:0> require 'treetop'
irb(main):002:0> require 'email_address'
irb(main):003:0> @e = EmailAddressParser.new
irb(main):004:0> @e.parse( 'foo' )
=> nil
irb(main):005:0> @e.parse( 'foo@bar.com' ).pieces
=> ["foo", "bar.com"]
irb(main):006:0> @e.parse( '!@[69.46.18.236]' ).pieces
=> ["!", "[69.46.18.236]"]
irb(main):007:0> @e.parse( '"Gavin Kistner"@[69.46.18.236]' ).pieces
=> ["\"Gavin Kistner\"", "[69.46.18.236]"]

The only 'cheat' I did was to omit any of the branches of the grammar
that were marked as obsolete (and yet included?) in the spec.

I'm not 100% sure that this is working perfectly, however. From the
ABNF spec:
addr-spec = local-part "@" domain
domain = dot-atom / domain-literal / obs-domain
domain-literal = [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
CFWS = *([FWS] comment) (([FWS] comment) / FWS)
comment = "(" *([FWS] ccontent) [FWS] ")"
ccontent = ctext / quoted-pair / comment
ctext = NO-WS-CTL / ; Non white space controls
%d33-39 / ; The rest of the US-ASCII
%d42-91 / ; characters not including "(",
%d93-126 ; ")", or "\"

From the above, I would think that the following was a legal email
address:
irb(main):008:0> @e.parse( 'foo@[0.0.0.0](ok)' )
=> nil

As shown, however, Treetop fails to parse it. Did I fail to port ABNF
to Treetop properly? Here are the relevant pieces:
rule addr_spec
local_part "@" domain
end
rule domain
dot_atom / domain_literal
end
rule domain_literal
CFWS? "[" (FWS? dcontent)* FWS? "]" CFWS?
end
rule CFWS
(FWS? comment)* ((FWS? comment) / FWS)
end
rule comment
"(" ( FWS? ccontent )* FWS? ")"
end
rule ccontent
ctext / quoted_pair / comment
end
rule ctext
NO_WS_CTL / [\x21-\x27\x2a-\x5b\x5d-\x7e]
end

Following is the full RFC2282 Treetop grammar, in case anyone wants to
play with it.

# http://tools.ietf.org/ht...
# All obsolete rules have been removed
grammar EmailAddress

rule addr_spec
local_part "@" domain {
def pieces
[ local_part.text_value, domain.text_value ]
end
}
end

rule local_part
dot_atom / quoted_string
end

rule domain
dot_atom / domain_literal
end

rule domain_literal
CFWS? "[" (FWS? dcontent)* FWS? "]" CFWS?
end

rule dcontent
dtext / quoted_pair
end

rule dtext
NO_WS_CTL / # Non white space controls
[\x21-\x5a\x5e-\x7e] # The rest of the US-ASCII characters
# not including "[", "]", or "\"
end

# Non-whitespace control characters
rule NO_WS_CTL
[\x01-\x08\x0b-\x0c\x0e-\x1f\x7f]
end

rule dot_atom
CFWS? dot_atom_text CFWS?
end

rule dot_atom_text
atext+ ( "." atext+ )*
end

# folding white space
rule FWS
(WSP* CRLF)? WSP+
end

rule CFWS
(FWS? comment)* ((FWS? comment) / FWS)
end

rule CRLF
"\r\n"
end

rule WSP
[ \t]
end
# Any character except controls, SP, and specials.
rule atext
ALPHA / DIGIT / [!#$\%&'*+\/=?^_`{|}~-]
end

rule ALPHA
[A-Za-z]
end

rule DIGIT
[0-9]
end

rule text
[\x01-\x09\x0b-\x0c\x0e-\x7f]
end

rule specials
[()<>\[\]:;@\\,.] / DQUOTE
end

rule DQUOTE
'"'
end

rule ccontent
ctext / quoted_pair / comment
end

rule quoted_pair
"\\" text
end

rule qtext
NO_WS_CTL / # Non white space controls
[0x21\x23-\x5b\x5d-\x7e] # The rest of the US-ASCII characters
# not including "\" or the quote
character
end

rule qcontent
qtext / quoted_pair
end

rule quoted_string
CFWS? DQUOTE (FWS? qcontent)* FWS? DQUOTE CFWS?
end

rule comment
"(" ( FWS? ccontent )* FWS? ")"
end

rule ctext
NO_WS_CTL / # Non white space controls
[\x21-\x27\x2a-\x5b\x5d-\x7e] # The rest of the US-ASCII
characters
# not including "(", ")", or "\"
end
end

13 Answers

Clifford Heath

1/28/2008 11:58:00 PM

Phrogz wrote:
> (I would post this to the treetop mailing list...except there doesn't
> seem to be one.)

It's fine here. A lot of folk here could benefit from using Treetop,
if only they could work out how to. Discussion here can help them.

Your problem lies in the direct translation of this rule, which is
poorly written in the RFC:

> CFWS = *([FWS] comment) (([FWS] comment) / FWS)

to:

> rule CFWS
> (FWS? comment)* ((FWS? comment) / FWS)
> end

The issue is that the first parenthesized group is a node that
will eat too much, succeed, then cause the second group to fail.
This is a feature of PEG parsers in general: once a node has
succeeded, it will never be retried even though there may be a
different way it can succeed. Any backtracking that comes to the
same point will remember that it succeeded here once before, and
will assume that the *same* success is correct.

You need to rewrite it to avoid this incorrect success. This rule
is equivalent:

rule CFWS
( FWS / comment )+
end

Just a bit simpler, eh?

Clifford Heath.

P.S. It'd be nice to know whether my other Treetop messages were useful to you...?

Phrogz

1/29/2008 3:41:00 AM

On Jan 28, 4:57 pm, Clifford Heath <n...@spam.please.net> wrote:
> Phrogz wrote:
> > (I would post this to the treetop mailing list...except there doesn't
> > seem to be one.)
>
> It's fine here. A lot of folk here could benefit from using Treetop,
> if only they could work out how to. Discussion here can help them.

Sure. I'd also like a place to send bug reports, however. So far, I've
been sending them to the only contact address I can find for the
entire project - Nathan's email as listed in the gemspec.

> Your problem lies in the direct translation of this rule, which is
> poorly written in the RFC:
>
> > CFWS = *([FWS] comment) (([FWS] comment) / FWS)
>
> to:
>
> > rule CFWS
> > (FWS? comment)* ((FWS? comment) / FWS)
> > end
>
> The issue is that the first parenthesized group is a node that
> will eat too much, succeed, then cause the second group to fail.
> This is a feature of PEG parsers in general: once a node has
> succeeded, it will never be retried even though there may be a
> different way it can succeed. Any backtracking that comes to the
> same point will remember that it succeeded here once before, and
> will assume that the *same* success is correct.
>
> You need to rewrite it to avoid this incorrect success. This rule
> is equivalent:
>
> rule CFWS
> ( FWS / comment )+
> end
>
> Just a bit simpler, eh?

You know, I looked at that rule (and some like it) and said "Man, what
were they smoking!?" But yeah, this was a quick translation over lunch
and I explicitly didn't want to try translating anything.

To be fair, that's not quite the same rule, right? Your rule would
allow FWS FWS FWS, while the original requires a comment between them.
(FWS itself only eats one CRLF at a time, so this change allows blank
lines where they were not allowed before. I think.)

I appreciate the help and insight. This insight scares me, I must
admit.

I was happy to accept the no-left-recursion limitation of the grammar,
because it's easy to spot and the downside is infinite recursion. If
it wasn't obvious when writing the grammar, it would be reasonably
obvious at runtime when your parse never finished.

But what you describe here seems far more insidious. This limitation
(which I assume is not in fact limited to PEG in general, but
specifically to the packrat memoization technique that makes this
solution perform reasonably) says to me: There are some cases where
you can write something that looks correct and technically *is*
correct, but that will silently fail on otherwise valid content.
Please (definitely) correct me if I'm wrong. I definitely don't want
to spread FUD.

Can you provide any guidelines or reading points on how to recognize a
set of rules that may fail like this? Is it limited to use of the zero-
or-more star symbol? Particularly likely to occur when any star or
plus quantifier appears at the start of a rule?

Thanks again for your help. I'm still excited about treetop, but less
so as I find that I actually have to know some theory about the
language I'm writing in (*gasp*) instead of being able to naively
throw together BNF-like rules and have it magically all work out. ;)

Phrogz

1/29/2008 4:46:00 AM

On Jan 28, 8:41 pm, Phrogz <phr...@mac.com> wrote:
> On Jan 28, 4:57 pm, Clifford Heath <n...@spam.please.net> wrote:
> > This is a feature of PEG parsers in general: once a node has
> > succeeded, it will never be retried even though there may be a
> > different way it can succeed. Any backtracking that comes to the
> > same point will remember that it succeeded here once before, and
> > will assume that the *same* success is correct.
....
> ... This limitation
> (which I assume is not in fact limited to PEG in general, but
> specifically to the packrat memoization technique that makes this
> solution perform reasonably) ...

My apologies for second guessing you. Having read the PEG paper[1], I
see now what you're saying. This PEG rule:
[aeiou]+ 'e'
behaves quite differently from this regular expression:
/[aeiou]+e/

I had thought - based on the use of the terms 'greedy' and
'backtracking' - that the above PEG rule would be able to match a
string like "aaae", just like the regexp. I see now that PEG in
general (packrat or not) are really, REALLY greedy. The above PEG rule
doesn't match that string because [aeiou]+ consumes the whole string,
and when it fails to find an 'e', it does NOT backtrack on the number
of repetitions of the character class.

Wow.

So...how would you write a PEG rule to match "a string of any number
of vowels, that ends with an 'e'"?

[1] http://pdos.csail.mit.edu/~baford/packrat/popl04/peg-...

Clifford Heath

1/29/2008 5:06:00 AM

Phrogz wrote:
> On Jan 28, 4:57 pm, Clifford Heath <n...@spam.please.net> wrote:
>> This rule is equivalent:
>> rule CFWS
>> ( FWS / comment )+
>> end
>> Just a bit simpler, eh?
> You know, I looked at that rule (and some like it) and said "Man, what
> were they smoking!?" But yeah, this was a quick translation over lunch
> and I explicitly didn't want to try translating anything.
>
> To be fair, that's not quite the same rule, right? Your rule would
> allow FWS FWS FWS, while the original requires a comment between them.

Well, yes-ish. If you look at the definition of FWS, it matches any non-zero
amount of whitespace, as long as it has only one CRLF (I missed this 2nd
restriction when checking equivalence). The easiest way to fix this is to
separate the CRLF from other whitespace, and ensure that there is comment
text (potentially containing whitespace) on every line except the first and
last, or something like that. An exercise for the reader ;-).

> I was happy to accept the no-left-recursion limitation of the grammar,
> because it's easy to spot and the downside is infinite recursion.

Yes, that's not something that's actually happened to me, so I've never
even needed to spot it :-). I ran into it big-time when I first used
TXL though!

> But what you describe here seems far more insidious. This limitation
> (which I assume is not in fact limited to PEG in general, but
> specifically to the packrat memoization technique that makes this
> solution perform reasonably)

Correct, I meant the packrat technique, not PEG.

> says to me: There are some cases where
> you can write something that looks correct and technically *is*
> correct, but that will silently fail on otherwise valid content.

There's a reason why Nathan points out that among a set of alternates,
the *first* one that succeeds is taken. That means, even if that makes a
higher-level rule fail, it won't try subsequent alternatives. Yes, it's
a trap, but once you get into the right thought pattern, it's not that
hard to avoid. You basically need to consider where you might mis-read
the text yourself, and introduce negative lookahead that prevents the
mis-reading.

Oh, and use TDD :-). You need that to ensure that whitespace is
skipped, or expected, everywhere it should, if for no other reason.
You just won't get it right with TT otherwise... but you won't with
any other tool either.

> Can you provide any guidelines or reading points on how to recognize a
> set of rules that may fail like this? Is it limited to use of the zero-
> or-more star symbol? Particularly likely to occur when any star or
> plus quantifier appears at the start of a rule?

Not just at the start of a rule, but anywhere in the rule. You need to
ensure that you insert a !wrong_path predicate to limit the repetition,
any time a repetition might chew too much. Likewise for alternation,
make sure you get the alternates in the right order, and dress any with
negative lookahead when an early choice might be wrong.

> Thanks again for your help. I'm still excited about treetop, but less
> so as I find that I actually have to know some theory about the
> language I'm writing in (*gasp*) instead of being able to naively
> throw together BNF-like rules and have it magically all work out. ;)

Have a play with ANTLRworks, which will analyze things and tell you when
you have a problem, and when you find that ANTLR's other restrictions
begin to frustrate the crap out of you (as happened to me), come back to
using Treetop. :-)

Does anyone else wish for the ability of Treetop to emit languages
other than Ruby? I'm probably going to need C# one day, and perhaps
Java.

Clifford Heath.

Clifford Heath

1/29/2008 5:16:00 AM

Phrogz wrote:
> On Jan 28, 8:41 pm, Phrogz <phr...@mac.com> wrote:
> My apologies for second guessing you. Having read the PEG paper[1], I
> see now what you're saying.

Well, though the authors of the packrat paper described a parser that
behaves that way, I don't think it's fair to limit the possible other
behaviours of any PEG parser generator. IOW, PEG >= packrat. ANTLR is
also LL(*), but avoids this problem by analyzing the expressions and
either rejecting or re-writing them.

> So...how would you write a PEG rule to match "a string of any number
> of vowels, that ends with an 'e'"?

Assuming some EOF token (not provided free by Treetop) you'd write:

(!('e' EOF) [aeiou])* 'e'

Of course, in many cases, EOF would be replaced by a white-space or
other rule.

Clifford Heath.

Vidar Hokstad

1/29/2008 3:48:00 PM

On Jan 28, 10:22 pm, Phrogz <phr...@mac.com> wrote:
> The only 'cheat' I did was to omit any of the branches of the grammar
> that were marked as obsolete (and yet included?) in the spec.

The "obsolete" productions were not obsolete in RFC 822, and hence
there may still be
a significant number of clients out there _capable_ of emitting e-mail
addresses using
that syntax.

However, in practice, most clients generate a very limited subset of
RFC 822 and/or
RFC 2822 compliant addresses - it's likely safe to ignore the obsolete
rules. 8-9
years ago, when I used to run a webmail system with 1-2 million
accounts, we didn't
even bother trying to parse the full RFC as we never saw e-mail coming
in using the more
"exotic" features, such as nested comments or quoted local parts.

Vidar

Matt Mower

1/29/2008 5:12:00 PM

On 1/29/08, Clifford Heath <no@spam.please.net> wrote:
> Does anyone else wish for the ability of Treetop to emit languages
> other than Ruby? I'm probably going to need C# one day, and perhaps
> Java.

Would I be alone in finding Objective-C support useful?

Regards,

Matt.

--
Matt Mower :: http://matt...

rubyfan

1/29/2008 6:00:00 PM

On 1/29/08, Matt Mower <matt.mower@gmail.com> wrote:
> On 1/29/08, Clifford Heath <no@spam.please.net> wrote:
> > Does anyone else wish for the ability of Treetop to emit languages
> > other than Ruby? I'm probably going to need C# one day, and perhaps
> > Java.
>
> Would I be alone in finding Objective-C support useful?
>

An Objective-C backend for TreeTop would rock!

That said, looking at the output from tt it seems that it's very much
oriented towards Ruby's dynamicity - extend is used extensively, for
example.

Not saying an Objective-C backend would be impossible, though....

Phil

Clifford Heath

1/29/2008 10:00:00 PM

Phil Tomson wrote:
> On 1/29/08, Matt Mower <matt.mower@gmail.com> wrote:
>> Would I be alone in finding Objective-C support useful?
> An Objective-C backend for TreeTop would rock!

Once the basic framework for multiple output languages was there,
it wouldn't be hard to add. At the moment there's a Ruby builder,
but the interface doesn't allow a simple substitution of another
builder. A fairly simple cleanup would fix that. I'd also like it
to emit more meaningful names, like "rulename_result" and
"rulename_sequence" instead of "r1" and "s0".

> That said, looking at the output from tt it seems that it's very much
> oriented towards Ruby's dynamicity - extend is used extensively, for
> example.
> Not saying an Objective-C backend would be impossible, though....

Not at all. The code blocks would have to be emitted into new classes
for Java. For C#, they'd be partial classes. I don't know Obj-C, but
I doubt it's harder than those two.

Clifford Heath.

Matt Mower

1/29/2008 11:43:00 PM

On 1/29/08, Clifford Heath <no@spam.please.net> wrote:
> Not at all. The code blocks would have to be emitted into new classes
> for Java. For C#, they'd be partial classes. I don't know Obj-C, but
> I doubt it's harder than those two.
>

No, probably easier.

Once basic support for other languages is there I would be glad to get
involved in generating the Obj-C code.

I already have a use in mind for Treetop + Objective-C + Nu.

M.

--
Matt Mower :: http://matt...

comp.lang.ruby

Treetop Email Parser

Phrogz

Clifford Heath

Phrogz

Phrogz

Clifford Heath

Clifford Heath

Vidar Hokstad

Matt Mower

rubyfan

Clifford Heath

Matt Mower

x Login to ForumsZone