vikkous
4/23/2005 9:36:00 PM
A lexer, or tokenizer (they mean the same thing) divides an input
source language into words. It also removes comments and finds the
boundaries of strings. Once this is done, it's much easier to correctly
process the language in a pre-processor or parser. Here's an example.
Given this ruby code:
8+(9 *5)
a correct lexing is something like:
["8","+","(","*","5",")"]
(For lexing purposes, punctuation and operators count as strings as
well.)
The ouput of RubyLexer is actually more complicated than that... for
one thing, there are tokens for whitespace as well. for another, the
individual tokens are not Strings, but Tokens (or subclasses of it, to
be precise), a class defined in RubyLexer. Tokens to respond to to_s in
the expected way, however. (Initially, I did want to have RubyLexer
just return Strings, but it turned out I needed to distinguish
different token types, and the best way to do that is with the type
system.)
ParseTree is a parser, not a lexer. Parsing is the next step in a
compiler pipeline; it determines what order to evaluate to operations
in an expression and solves the difficult problems of precedence and
associativity. (Another way to think of parsers is as the bit that
figures out where the implicit parentheses are inserted into the source
code.) I think that the tool corresponding to RubyLexer is Ripper, but
I don't really know, so don't blame me if I'm wrong.
I have lots of plans, of course, but being only one little programmer
with lots of big ideas, who knows if I'll ever get to them...