Asp Forum - Parsers vs. Homemade "Parsing" via REs

Randy Kramer

2/27/2005 2:54:00 PM

I have the need to translate several megabytes of TWiki marked up text to
HTML. In fact, it may not even be a one time thing--I'm planning to build a
wiki like thing, and will probably keep TWiki markup as the native language
for storage. Over time, I may extend the TWiki markup syntax.

(Aside: I'm aware that Ruby has two other markup languages sometimes used for
wikis (Red Cloth (?) and Text <something>?)--I may someday support those as
well, but I'm not immediately interested in converting the megabytes of data
I have and learning a different markup language.)

My impression is that some or many wikis (this is my impression of TWiki)
don't use a "real parser" (like YACC or whatever), but instead simply search
and replace in the text using (many) Regular Expressions. Conceptually, that
seems an easy approach (less learning on my part, but probably tedious
creation of many REs (or borrowing from TWiki's Perl).

I've never used a parser, and am not really that familiar with them, but I'm
wondering what the tradeoffs might be. I've heard that a parser may be
easier to modify to extend the syntax. (But, I have a feeling that the TWiki
markup language might not be "regular" enough to be parsed by something like
YACC.):

* If I did create the proper grammar rules, would parsing using something
like YACC be faster than a bunch of RE replacements?

* Any recommendations for a parser in Ruby? I think there are a couple,
I've been doing some Googling / reading and have come across references to
parse.rb and (iirc) something called Coco (??).

* Anybody know the approach followed by existing Ruby wikis like Instiki,
Ruwiki, etc.?

Other comments, hints, suggestions?

Randy Kramer

10 Answers

Austin Ziegler

2/27/2005 3:11:00 PM

On Sun, 27 Feb 2005 23:54:02 +0900, Randy Kramer
<rhkramer@gmail.com> wrote:
> * If I did create the proper grammar rules, would parsing using
> something like YACC be faster than a bunch of RE replacements?

> * Any recommendations for a parser in Ruby? I think there are a
> couple, I've been doing some Googling / reading and have come
> across references to parse.rb and (iirc) something called Coco
> (??).
>
> * Anybody know the approach followed by existing Ruby wikis like
> Instiki, Ruwiki, etc.?

Ruwiki uses regular expressions to convert. There are portions of
this what I will probably be moving toward a more traditional parser
-- particularly those things which aren't converting well to XHTML
(I am attempting to ensure XHTML compliance in Ruwiki's generated
output).

As I understand it, RedCloth (which supports both Markdown and
Textile in version 3), does everything through regexen as well.

The biggest problem with regexen is not speed -- they're FAST. The
biggest problem is interaction of regexen. Ruwiki solves this, in
part, by adding priorities to tokens.

-austin
--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Thomas Counsell

2/27/2005 4:18:00 PM

Instiki uses RedCloth which converts a language called Textile. The
other ones markup languages I've seen around are BlueCloth, which
converts a markup language called markdown, and the built in RDoc.
AFAIK none are very close to TWiki markup.

Austin is right that RedCloth uses a bunch of Regexps rather than a
parser, and that the problems with this is clashing between Regexps
rather than speed.

The techniques I've seen in the RedCloth code to reduce clashing are:
1) Complicated regexps that try to be very, very careful about what
they match (but consequently can be vulnerable to stack overflows)
2) Being very careful about the order of the RE replacements
3) Doing the replacement in two stages, first a RE matches a symbol
(e.g. a url), and replaces it in the text with a code (e.g. $1), then
once all the REs have matched, it goes back through the code replacing
the codes (e.g. $1) with the desired text (<a href=...).

I had a go at re-implementing Redcloth in a more parser-like fashion
(although not using any external library) and found it much easier to
code and understand conceptually, but it worked out much much slower.
With hindsight, I think that parsers will tend always to be slower
because in wiki markup (and unlike in programming languages), most of
the text is not symbolically important therefore it is quicker to look
for the symbols using a bunch of RE replacements than consider each bit
of string and ask whether it is a symbol.

Having said that, if you choose a parser that spits out C code then it
may be quicker than RE subs in pure ruby. There do appear to be some
libraries for this, but I'm afraid I haven't any experience to pass on.

Hope that helps

Tom

On 27 Feb 2005, at 14:54, Randy Kramer wrote:

> I have the need to translate several megabytes of TWiki marked up text
> to
> HTML. In fact, it may not even be a one time thing--I'm planning to
> build a
> wiki like thing, and will probably keep TWiki markup as the native
> language
> for storage. Over time, I may extend the TWiki markup syntax.
>
> (Aside: I'm aware that Ruby has two other markup languages sometimes
> used for
> wikis (Red Cloth (?) and Text <something>?)--I may someday support
> those as
> well, but I'm not immediately interested in converting the megabytes
> of data
> I have and learning a different markup language.)
>
> My impression is that some or many wikis (this is my impression of
> TWiki)
> don't use a "real parser" (like YACC or whatever), but instead simply
> search
> and replace in the text using (many) Regular Expressions.
> Conceptually, that
> seems an easy approach (less learning on my part, but probably tedious
> creation of many REs (or borrowing from TWiki's Perl).
>
> I've never used a parser, and am not really that familiar with them,
> but I'm
> wondering what the tradeoffs might be. I've heard that a parser may be
> easier to modify to extend the syntax. (But, I have a feeling that
> the TWiki
> markup language might not be "regular" enough to be parsed by
> something like
> YACC.):
>
> * If I did create the proper grammar rules, would parsing using
> something
> like YACC be faster than a bunch of RE replacements?
>
> * Any recommendations for a parser in Ruby? I think there are a
> couple,
> I've been doing some Googling / reading and have come across
> references to
> parse.rb and (iirc) something called Coco (??).
>
> * Anybody know the approach followed by existing Ruby wikis like
> Instiki,
> Ruwiki, etc.?
>
> Other comments, hints, suggestions?
>
> Randy Kramer
>
>

Nikolai Weibull

2/27/2005 6:15:00 PM

* Randy Kramer (Feb 27, 2005 16:00):
> My impression is that some or many wikis (this is my impression of
> TWiki) don't use a "real parser" (like YACC or whatever), but instead
> simply search and replace in the text using (many) Regular
> Expressions. Conceptually, that seems an easy approach (less learning
> on my part, but probably tedious creation of many REs (or borrowing
> from TWiki's Perl).

Yes, this is a rather horrendous misuse of regular expressions. It
works, but it's very brittle. As Robert Klemme said, the biggest
problem isn't speed (although it is certainly an issue to factor in),
rather that the interaction between the regular expressions may be
non-obvious. Using a proper grammar you can be clear of how your data
is being parsed.

The place for regular expressions are not in describing grammars, but
tokens.

> ... I have a feeling that the TWiki markup language might not be
> "regular" enough to be parsed by something like YACC ...

Eh, "regular" is a weird word to use. Regular expressions map onto
something called regular languages or regular sets. Regular languages
are more constrained in what constructions may be used in the language
than for a context-free language. Context-free languages are what are
usually described by a parser-generator like YACC or RACC (the Ruby
equivalent of the heavily C-bound YACC).

Most markup languages (and perhaps most non-natural languages) are
context-free in nature. Thus, using a parser-generator like RACC fits
well with what you are trying to do.

I couldn't say what kind of language the TWiki is, though, as I am not
familiar with its syntax/grammar.

> * If I did create the proper grammar rules, would parsing using
> something like YACC be faster than a bunch of RE replacements?

Yes. For any non-trivial problem, this would be the case. Note,
however, that you still need to feed tokens to [YR]ACC, and this can be
slow if done wrong. Also, it may be appropriate to write your own
parser from scratch, if the grammar is simple enough. A
recursive-decent parser is easier to implement and understand than
a state-machine one (like those created by [YR]ACC). Of course, the
fact that they create the state-machine for you is good, but it may
still be troublesome to understand how the grammar interacts with the
grammar-rules. This becomes obvious in a recursive-decent parser.

> * Any recommendations for a parser in Ruby? I think there are a
> couple, I've been doing some Googling / reading and have come
> across references to parse.rb and (iirc) something called Coco
> (??).

Check out RACC,
nikolai

--
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

why the lucky stiff

2/27/2005 6:50:00 PM

Randy Kramer wrote:
> * If I did create the proper grammar rules, would parsing using something
> like YACC be faster than a bunch of RE replacements?
>
> * Any recommendations for a parser in Ruby? I think there are a couple,
> I've been doing some Googling / reading and have come across references to
> parse.rb and (iirc) something called Coco (??).

I've written probably twenty different parsing libraries -- including
six complete revisions of the YAML parser -- and I personally favor
writing the parser in C (with Bison and Re2c). I find the two as easy
to use as any Ruby equivalents and re2c's lexer lets me control the
buffer better -- in case I need to turn a token into multiple events or
backtrack.

I hope no one is using RedCloth's parser as an example. It's awful.
It's like when Ron Popeil injects garlic into a turkey using that freaky
flavor syringe and you're like, "Grrrosss, he killed the turkey,
polished it up and forced it to be a junkie." RedCloth is all the
regexps from textile.php and markdown.pl forced into a hot, enclosed area --

It's a breeding ground for moth and ringworm. I have two evil Bad Luke
hands afterwards, which first try to strangle me and, failing that, then
go after people on the street with Ron Popeil's patented rib cage shears.

_why

Thomas

2/28/2005 9:16:00 AM

> (Aside: I'm aware that Ruby has two other markup languages sometimes used for
> wikis (Red Cloth (?) and Text <something>?)--I may someday support those as
> well, but I'm not immediately interested in converting the megabytes of data
> I have and learning a different markup language.)
>
> My impression is that some or many wikis (this is my impression of TWiki)
> don't use a "real parser" (like YACC or whatever), but instead simply search
> and replace in the text using (many) Regular Expressions. Conceptually, that
> seems an easy approach (less learning on my part, but probably tedious
> creation of many REs (or borrowing from TWiki's Perl).

<plug>deplate (http://depl...)</plug> uses a syntax that has some
similiarities with TWiki. It uses REs for tokenizing the text and
turning the input string into arrays of objects. Basically, deplate is
meant to be adaptable to different wiki-like markups but I haven't had
the time yet to get there.

Cheers,
Thomas.

Rob .

2/28/2005 3:13:00 PM

My implementation of an "auto-indent and insert end" feature in a
jEdit Ruby Plugin[1] uses regular expressions to determine if an end
is required. I also use regex's to identify classes and methods for a
"file structure popup" feature.

In my experience it was easy to implement with regex's early on, but
the maintanance and enhancement has proved troublesome as I've come
across more obscure syntax situations. I'm now investigating using a
version of JRuby[2] that has been modified by the Eclipse Ruby
Development Tool[2] team.

Rob

[1]http://community.jedit.org/cgi-bin/TWiki/view/Main/...
[2]http://jruby.source...
[3]http://rubyeclipse.source...

On Sun, 27 Feb 2005 23:54:02 +0900, Randy Kramer <rhkramer@gmail.com> wrote:
> I have the need to translate several megabytes of TWiki marked up text to
> HTML. In fact, it may not even be a one time thing--I'm planning to build a
> wiki like thing, and will probably keep TWiki markup as the native language
> for storage. Over time, I may extend the TWiki markup syntax.
>
> (Aside: I'm aware that Ruby has two other markup languages sometimes used for
> wikis (Red Cloth (?) and Text <something>?)--I may someday support those as
> well, but I'm not immediately interested in converting the megabytes of data
> I have and learning a different markup language.)
>
> My impression is that some or many wikis (this is my impression of TWiki)
> don't use a "real parser" (like YACC or whatever), but instead simply search
> and replace in the text using (many) Regular Expressions. Conceptually, that
> seems an easy approach (less learning on my part, but probably tedious
> creation of many REs (or borrowing from TWiki's Perl).
>
> I've never used a parser, and am not really that familiar with them, but I'm
> wondering what the tradeoffs might be. I've heard that a parser may be
> easier to modify to extend the syntax. (But, I have a feeling that the TWiki
> markup language might not be "regular" enough to be parsed by something like
> YACC.):
>
> * If I did create the proper grammar rules, would parsing using something
> like YACC be faster than a bunch of RE replacements?
>
> * Any recommendations for a parser in Ruby? I think there are a couple,
> I've been doing some Googling / reading and have come across references to
> parse.rb and (iirc) something called Coco (??).
>
> * Anybody know the approach followed by existing Ruby wikis like Instiki,
> Ruwiki, etc.?
>
> Other comments, hints, suggestions?
>
> Randy Kramer
>
>

Mark Probert

2/28/2005 6:41:00 PM

Hi ..

On Sunday 27 February 2005 06:54, Randy Kramer wrote:
> I have the need to translate several megabytes of TWiki marked up text to
> HTML. ...
>
> * If I did create the proper grammar rules, would parsing using
> something like YACC be faster than a bunch of RE replacements?
>
I think that you mileage may vary, depending on the nature of the text markup
and how complicated you want the HTML. If it is something like tag
replacement, then regexp is probably the way to go. It will be, on the
whole, faster to implement for such a "one-off" job.

> * Any recommendations for a parser in Ruby? I think there are a couple,
> I've been doing some Googling / reading and have come across references to
> parse.rb and (iirc) something called Coco (??).
>
I am responsible for one of the Coco/R implementations (coco/rb). The
"something" is a what's called an attributed LL(1) parser-generator, as
opposed to YACC, which is LALR. These are just different ways that the
parser attacks decomposition of the target file. You can find out more at

http://www.scifac.ru.ac.za/...

There are some basic tradeoffs. Coco is very fast and the grammar and scanner
are all in one place (YACC uses LEX as a tokeniser, which is a separate
program and input file). YACC is somewhat more flexible in grammars it can
produce. For example, it is not possible to write a Ruby parser in COCO/r,
though you can do it YACC. However, for 99% of little-languages, that is not
such a big problem, just avoid constructs like 'goto'.

Regards,

--
-mark. (probertm at acm dot org)

Aredridel

3/1/2005 6:58:00 AM

> My impression is that some or many wikis (this is my impression of TWiki)
> don't use a "real parser" (like YACC or whatever), but instead simply search
> and replace in the text using (many) Regular Expressions. Conceptually, that
> seems an easy approach (less learning on my part, but probably tedious
> creation of many REs (or borrowing from TWiki's Perl).

I've written a wiki, though in PHP, not Ruby (yet; I'm porting)

The parser is basically recursive-descent, hand-coded, with an
insanely complex regex tokenizer.

Parsing human text with a strict parser (like racc/yacc) is Really
Hard -- the LALR(1) algorithm has too many limits to do it justice.
Even fitting it into a GLR shaped box is hard, though I'm trying to in
my rewrite.

Your best bet is probably to use a hand-coded Recursive Descent
parser, where you drag in big blobs of text, divvy it up into
successively smaller pieces, calling the parser for each recursion,
with fewer and fewer options remaining. The parsing might go like so:

With the input paragraph:

_/hi/, there_

and the markup objects "Bold" and "italic" and "plain"

your parser might match a single token: "bold text"

Then you strip the _ _, and you have

/hi/, there

Your parser would recognize two tokens: "italic text" (hi), and plain
(", there")

It'd strip each: plain(hi), and then nothing further would match.

Your structure in memory would look like this:

Bold(Italic("hi")."there")

Which is then pretty trivial to mark up as HTML.

The funky cases come in things not easily tokenized: lists,
particularly, are a pain. My current parser does ugly things like
guess whether something is list-like, hands the whole blob to a
listifier routine, which then throws out any plaintext bits for
rendering. Ugly, but works.

A GLR parser, on the other hand, instead of breaking things down,
builds them up -- with rules like "_", anything, "_" is bold, you'd
break your stream down into tokens like so:

Literal(_) Literal(/) Text(hi) Literal(/) Text(, there) Literal(_)

And then with the rules "_", anything, "_" -> bold
"/", anything, "/" -> italic
italic*|bold*|plain* -> anything
else -> plain

The parser could then reduce that to the same above structure. Problem
is, that takes many tokens of look-ahead, so the parser could get
slow.

an LALR(1) parser is just like above, except that the rules can only
use one token of look-ahead to figure out what to do. Obviously, for
the enormous variety in human text, that's a huge limit.

Hope that makes some sense.

Ari

P.S. Why not search and replace? Assume "/" means italic and "-" means
bold. Then parse this:

http://www.amazon.com/exec/obidos/tg/browse/-/3348291/ref%3Dbr%5Fbx%5Fc%5F2%5F2/002-2563138-3700849/-/foo/002-25631...

And try not to get <a href='....<em> <strong>....' ...

Randy Kramer

3/2/2005 9:05:00 PM

Thanks to everyone who responded on this thread, and especially to
Aredridel--the examples here helped me understand at least two varieties of
parsers considerably better than I did.

There is a folloup for Aredridel below.

I think I'm heading in the direction of writing something that might be called
a parser (or might not be). My basic aim is to minimize the number of passes
through the document. (Simple RE substitution requires a pass through the
document for each RE (IIUC).) I think I can do it with one or two passes,
with a rather elaborate case type statement (or many ifs) on examination of
each "token" (word).

I've written some other questions to the list. I'm going through TWiki and
trying to develop a list of all the TWiki markup--there is a lot I don't use
(so far) and I may initially deal only with the more common stuff.

(In the back of my mind I am still reserving some last resort options--I can
always install an instance of TWiki locally, and "rig" a way to feed markup
to it and recover the HTML.)

On Tuesday 01 March 2005 01:57 am, Aredridel wrote:
> The funky cases come in things not easily tokenized: lists,
> particularly, are a pain. My current parser does ugly things like
> guess whether something is list-like, hands the whole blob to a
> listifier routine, which then throws out any plaintext bits for
> rendering. Ugly, but works.

How are your lists marked up? (Or are you trying to recognize them without a
specific markup, more of a natural language type thing?)

thanks again to all,
Randy Kramer

Mariusz Marszalkowski

5/3/2013 11:47:00 PM

W dniu sobota, 4 maja 2013 00:38:12 UTC+2 uzytkownik bartekltg napisal:
> Proca jest starsza niz jakiekolwiek znane miasto;)
Ale to jest gumomajtkowa proca ;)
Pozdrawiam

comp.lang.ruby

Parsers vs. Homemade "Parsing" via REs

Randy Kramer

Austin Ziegler

Thomas Counsell

Nikolai Weibull

why the lucky stiff

Thomas

Rob .

Mark Probert

Aredridel

Randy Kramer

Mariusz Marszalkowski

x Login to ForumsZone