Asp Forum - Lexing and Parsing in Ruby.

John Carter

11/19/2003 5:32:00 AM

2 Answers

Melanie Fielder

11/19/2003 7:52:00 AM

John Carter <john.carter@tait.co.nz> skrev i en
nyhedsmeddelelse:Pine.LNX.4.58.0311191821040.6241@parore.tait.co.nz...
> # Here is a lexical analyzer that matches that breaks a
> # Graphviz "Dot" file into tokens...
> # digraph a_name {
> # a -> b;
> # c -> d -> e;
> # f -> {a;b;c}
> # d;
> # };"
> #
> # Yields "digraph","a_name","{","a","->","b",";",....]
> def tokenize(string)

Have you seen the LexerInRuby, its a nice example which shows how it
can be done:
http://www.rubygarden.org/ruby?L...

--
Simon Strandgaard

Robert Klemme

11/19/2003 6:25:00 PM

"John Carter" <john.carter@tait.co.nz> schrieb im Newsbeitrag
news:Pine.LNX.4.58.0311191821040.6241@parore.tait.co.nz...
> Ooh this is so,so sweet....
>
> If you know anything about lexical analyzers and parsers, read this
> carefully and realize I have done something very very very Good
> here...

Sorry, I don't see what's so very very very good about it. Will you
enlighten me?

> # Here is a lexical analyzer that matches that breaks a
> # Graphviz "Dot" file into tokens...
> # digraph a_name {
> # a -> b;
> # c -> d -> e;
> # f -> {a;b;c}
> # d;
> # };"
> #
> # Yields "digraph","a_name","{","a","->","b",";",....]
> def tokenize(string)
> regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9\.]+|[0-9\.]+)
| -> | \{ | \} | \; ) }x
> match_data = regexp.match( string)
> while match_data
> token = match_data[0]
> yield token unless token =~ %r{ ^\s }x
> match_data = regexp.match( match_data.post_match)
> end
> end

def tokenize(string)
string.scan( /[a-z_][a-z_\d]* | [+-]?[.\d]+ | -> | [{};]/xi ) do |m|
yield m
end
end

> # Now here is the trick...
> # If we map tokens to characters
> # then parsing becomes a matter of ....
> # Matching a regular Expression again!!

I don't know the Graphviz format but this does only work as a real parser
(i.e. one that verifies the input against a grammar) if that file format
does not use nesting constructs with arbitrary depth, i.e., if it has a
regular grammar (vs. context free grammar). This property of the language
does not change if you replace tokens by shorter tokens.

> def parse(string)
> # Build up a list of tokens in an array
> token_list = []
> # Make a one to one map between token type and the array
> token_string = ''
>
> tokenize(string) do |token|
> token_list << token
> if token == 'digraph'
> token_string += 'd'
> elsif token =~ %r{ [\{\};] }x
> token_string += token
> elsif token == '->'
> token_string += '>'
> elsif
> token_string += 'i'
> end
> end

Btw, note that string << 'i' is more efficient than string += 'i'.

And you can combine the two steps with a slight modified regexp, since you
know what you get during scanning already:

def tokenize(string)
tk=[]
tl=""

string.scan( /(digraph) | ([a-z_][a-z_\d]*) | ([+-]?[.\d]+) | (->) |
([{};])/xi ) do |di,name,num,arr,noise|
if di
tk << di
tl << 'd'
elsif name
tk << name
tl << 'i'
elsif num
tk << num
tl << 'i'
elsif arr
tk << arr
tl << '>'
elsif noise
tk << noise
tl << noise
else
# never here
end
end

[tk, tl]
end

Now you can do

irb(main):144:0> tokens, pattern = tokenize( "digraph a_name {
irb(main):145:1" a -> b;
irb(main):146:1" c -> d -> e;
irb(main):147:1" f -> {a;b;c}
irb(main):148:1" d;
irb(main):149:1" };" )
=> [["digraph", "a_name", "{", "a", "->", "b", "c", "->", "d", "->", "e",
"f", "
->", "{", "a", "b", "c", "}", "d", "}"], "di{i>ii>i>ii>{iii}i}"]
irb(main):150:0> tokens
=> ["digraph", "a_name", "{", "a", "->", "b", "c", "->", "d", "->", "e",
"f", "-
>", "{", "a", "b", "c", "}", "d", "}"]
irb(main):151:0> pattern
=> "di{i>ii>i>ii>{iii}i}"
irb(main):152:0>

Rest as before.

Cheers

robert

> # Now our example is a string "di{i>i;i>i>i;i>{i;i;i}i;}"
> regexp = %r{ ^ d i \{ ((i ; | i (> i)+ ; | i > \{ i (; i)* \})*) \} ;?
$ }x
> match_data = regexp.match( token_string)
> raise "Parse error '#{token_string}' doesn't match '#{regexp.source}'"
unless match_data
> return [token_list[1], match_data[1],
token_list[match_data.begin(1)...match_data.end(1)]]
> end
>
> name, edge_string, edge_list = parse(io)
>
> # Now edge_string is i>i;i>i>i;i>{i;i;i}i;
> # name is "a_name"
> # edge_list is ["a","->","b",";","c",...]
>
>
> John Carter Phone : (64)(3) 358 6639
> Tait Electronics Fax : (64)(3) 359 4632
> PO Box 1645 Christchurch Email : john.carter@tait.co.nz
> New Zealand
>
> A Million Monkeys can inflict worse things than just Shakespeare on
> your system.
>
>

comp.lang.ruby

Lexing and Parsing in Ruby.

John Carter

Melanie Fielder

Robert Klemme

x Login to ForumsZone