Asp Forum - Re: [ANN] cursor-0.6

Eric Mahurin

5/26/2005 2:33:00 PM

--- Caleb Clausen <vikkous@gmail.com> wrote:
> I am very interested in Cursor for use in Reg.

Thanks for the interest. If you want to help out, let me know.
I think the API is stabilizing. The only functional change in
my mind right now is having "position"s shift with insertions
and deletions before them. A lot of optimization is needed
(i.e. derived classes do one element at a time and #pos/#pos=
is typically O(n) instead of O(1)). I'm moving on to Grammar
right now.

> It seems like
> the sort
> of thing to allow Reg's Array-based backtracking pattern
> matcher to
> operate on Strings or Files or whatever else.
> (Did you know that I needed this by esp? :) Since you're got
> your own
> lexer/parser/pattern matcher thing, (grammar?) maybe you
> don't want to
> help the competition...

Competition is good for the consumer :) The competitors
enhance their own products and use each others ideas.

Since we are both doing parsing type things, I can see how this
might be useful to you.

> Anyway, cursor isn't quite functional enough for my purposes.
> Passing
> patterns for the length parameter to #get and #set is great,
> but there
> aren't enough pattern types. I think I need the equivalent of
> Cursor#===(regexp), which is to say, the ability to compare a
> Regexp
> to a position within a String. Or, if the backing store of
> the cursor
> is a file, you get the ability to compare a Regexp to a File.
> Someone
> was asking for that recently, it's not easy, but it would be
> great if
> your lib did it.... just the ability to regexp against the
> middle of a
> String would make me very happy, tho. I've hacked this up
> before, but
> I think that a reasonably efficient version would require a
> bit of c
> extension.

Funny you ask for that - cursor===regexp. Instead, my grammar
package will support grammar===cursor. I think the only case I
could fully implement cursor===regexp is when the data the
cursor is accessing is a string. I can't do it for the case
when the cursor is a IO (unless I read the whole file into a
String). Regexp is too tied to String. Also, right now,
nothing in Cursor is tied specifically to Strings (or
characters). I want to keep it that way - dealing with
elements and element sequences.

This === pattern matching would be better placed in the pattern
matching class if it is possible - regexp=== cursor,
grammar===cursor, or reg===cursor.

> Ideally, of course, I want to see the ability to compare
> against a
> Reg, particularly the Regs that match multiple array
> elements. If you
> made it work with whatever the equivalent in your system is,
> that
> would be almost as good.
>
> How does Grammar work? I would appreciate if you could write
> a couple
> of paragraphs of high-level overview of Cursor, Grammar,
> their design
> and capabilites and how they're intended to work together.

Cursor is a unification of many external iterator ideas -
iterators from C++, streams from Java, iterators from Java,
Ruby IO, and a text editor "cursor". Compared to most other
general external iterators, Cursor offers two interesting
features - ability to insert/delete and the ability to
save/restore a position.

Since I haven't released grammar yet, its predecessor will give
you an idea:

http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-t...

In this, I'm using RandomAccessStream instead of Cursor. And
Syntax is equivalent to the Grammar I'm working on.

A Grammar is used to match what's next in Cursor. Like Cursor,
it can deal with any objects, not just characters and Strings.

I provide some operator overloading to make specifying a
Grammar like BNF - "|" (Grammar::Alteration), "+"
(Grammar::Sequence), "*" (Grammar::Repeat), etc. I also put
some of these in built-in classes (String, Range) to make it
even easier (i.e. "hi"|"hello" makes a Grammar that matches
either "hi" or "hello").

With this you can make a parser that works directly on the
Cursor (from an IO, String, etc) or you can split it up into
lexer and parser. In this case, the parser (a Grammar) would
deal with a Cursor that spits out tokens. You can make your
own lexer for that or use Grammar again. Using Grammar to make
a lexer you'd have first need the Grammar for a token which
should be a big Alteration (or a table lookup). This token
Grammar would also need to tag these appropriately to
distinguish token types for the parser. The lexer would then
be a Cursor which would returns matches from the token Grammar
against the original Cursor. Here is some psuedo code to give
an idea of what was just said:

token = (tokenGrammar===ioCursor)

tokenCursor = Cursor of tokens (lexer)

parsetree = (parserGrammar===tokenCursor)

Like the ideas?

__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site
http://smallbusiness.yahoo.com/...

8 Answers

Caleb Clausen

5/27/2005 4:14:00 AM

Eric Mahurin wrote:
> Funny you ask for that - cursor===regexp. Instead, my grammar
> package will support grammar===cursor. I think the only case I
> could fully implement cursor===regexp is when the data the
> cursor is accessing is a string. I can't do it for the case
> when the cursor is a IO (unless I read the whole file into a
> String). Regexp is too tied to String. Also, right now,
> nothing in Cursor is tied specifically to Strings (or
> characters). I want to keep it that way - dealing with
> elements and element sequences.
>
> This === pattern matching would be better placed in the pattern
> matching class if it is possible - regexp=== cursor,
> grammar===cursor, or reg===cursor.

This is the right way, with the pattern on the left side of === and
the data on the right. (I've done it the other way before, but it's
yukky.) However, I still want a more sophisticated matcher built-in to
Cursor, else I'm not sure how to make use of Cursors.

Regexp#===(FileCursor) is just as hard as the other way around. How
would that be implemented?

> I provide some operator overloading to make specifying a
> Grammar like BNF - "|" (Grammar::Alteration), "+"
> (Grammar::Sequence), "*" (Grammar::Repeat), etc. I also put
> some of these in built-in classes (String, Range) to make it
> even easier (i.e. "hi"|"hello" makes a Grammar that matches
> either "hi" or "hello").

I've done much the same with Reg. However, I don't mix-in these
capabilites to any standard class except Regexp. (The user can always
request that they be mixed in, tho, so my equivalent to your
"hi"|"hello" is "hi".reg|"hello", which I admit is a bit ugly.)

Are you allowing Regexps in your language? Or is the idea that users
can build up Regexp-like patterns with the (equivalent) mechanisms
you're supplying? Regexps are one of the few things in ruby that are
actually fast; doing your own will be much slower...

I suppose you could translate a Regexp into an equivalent Grammar (or
Reg), but again, it will be slow. (And a lot of work.)

I wish you could tell me more about Grammar (or Syntax) internals...
Is it an LL or LR thang? How is matching actually done at runtime? Do
you compile to a parse table? Or 'interpret' the parsing at runtime,
as I am doing. My own system is Regexp-like, featuring an mostly
unoptimized DFA engine. Reg::Constant makes possible recursive
patterns, which makes the whole thing LL-equivalent (I think).

What happens when this code runs:
"foo"|"bar"|"baz"===cursor

Does cursor get scanned 3 times, for "foo", then "bar", then "baz"?
That's going to be slow... especially if cursor is a FileCursor.
Alternation doesn't need to be so slow... you can scan until you see
the first f or b, then attempt a match from there. This eliminates the
need to go through the cursor's data multiple times. But I don't see
support for this kind of thing in Cursor. I understand your need to
keep Cursor clean desire to get on to the fun stuff, but I beg you to
add this one additional feature: to read data from the cursor position
until you find a member of a set of items given to you by the user. I
think this capability makes possible an NFA engine atop Cursor, which
should be reasonably fast. And it ought to be easy enough for you.

In other words, I'd like to see Cursor#get(Set) or
Cursor#get(CharSet), where which one you pass depends on the type of
cursor. (This actually gets kind of complicated when it comes to
objects... in addition to a Set, you'd like to be able to take a Proc
too... or maybe just an object that responds to ===.

> With this you can make a parser that works directly on the
> Cursor (from an IO, String, etc)

Please tell me how this would work, because I have never been able to
envisage it properly, but I like the idea.

> token = (tokenGrammar===ioCursor)
>
> tokenCursor = Cursor of tokens (lexer)
>
> parsetree = (parserGrammar===tokenCursor)
>
> Like the ideas?

Well, yes, very much so. Actually, I'm suspecting esp again.

Have you given any thought to having a Cursor into the object graph,
or Cursors into graphs in general? This is related to something I want
to do with Reg, but I haven't figured it out yet...

Oh, and incidently, Grammar::Alteration should really be Grammar::Alternation.

Eric Mahurin

5/28/2005 7:51:00 PM

--- Caleb Clausen <vikkous@gmail.com> wrote:
> Eric Mahurin wrote:
> > Funny you ask for that - cursor===regexp. Instead, my
> grammar
> > package will support grammar===cursor. I think the only
> case I
> > could fully implement cursor===regexp is when the data the
> > cursor is accessing is a string. I can't do it for the
> case
> > when the cursor is a IO (unless I read the whole file into
> a
> > String). Regexp is too tied to String. Also, right now,
> > nothing in Cursor is tied specifically to Strings (or
> > characters). I want to keep it that way - dealing with
> > elements and element sequences.
> >
> > This === pattern matching would be better placed in the
> pattern
> > matching class if it is possible - regexp=== cursor,
> > grammar===cursor, or reg===cursor.
>
> This is the right way, with the pattern on the left side of
> === and
> the data on the right. (I've done it the other way before,
> but it's
> yukky.) However, I still want a more sophisticated matcher
> built-in to
> Cursor, else I'm not sure how to make use of Cursors.
>
> Regexp#===(FileCursor) is just as hard as the other way
> around. How
> would that be implemented?

I think you'd have to rewrite it from scratch in Ruby or do
stuff in C. Like I said, Regexp is too tied to String. If it
could handle String-like objects, you might be able to do it.
Or if it would have a callback when it reached the end of the
string (or the matched failed) that might help. Unfortunately,
there isn't enough control.

> > I provide some operator overloading to make specifying a
> > Grammar like BNF - "|" (Grammar::Alternation), "+"
> > (Grammar::Sequence), "*" (Grammar::Repeat), etc. I also
> put
> > some of these in built-in classes (String, Range) to make
> it
> > even easier (i.e. "hi"|"hello" makes a Grammar that matches
> > either "hi" or "hello").
>
> I've done much the same with Reg. However, I don't mix-in
> these
> capabilites to any standard class except Regexp. (The user
> can always
> request that they be mixed in, tho, so my equivalent to your
> "hi"|"hello" is "hi".reg|"hello", which I admit is a bit
> ugly.)
>
> Are you allowing Regexps in your language? Or is the idea
> that users
> can build up Regexp-like patterns with the (equivalent)
> mechanisms
> you're supplying? Regexps are one of the few things in ruby
> that are
> actually fast; doing your own will be much slower...

I have no plans to put Regexps in Grammar because I don't know
how to get them to work on anything but a String (my main
target would be parsing - an IO). The problem is you have to
everything you might want to match in a String before matching
with a Regexp. Think about a Regexp for a multi-line comment -
how many lines do you read into a String before attempting to
match? If you have any ideas of how to use them more generally
I'm listening. I would like to incorporate them because of the
speed issue you said.

With what I'm doing there is no reason that couldn't use
Regexps to cobble a lexer together that implements the Cursor
interface and then use Grammar as the parser.

> I suppose you could translate a Regexp into an equivalent
> Grammar (or
> Reg), but again, it will be slow. (And a lot of work.)
>
> I wish you could tell me more about Grammar (or Syntax)
> internals...
> Is it an LL or LR thang? How is matching actually done at
> runtime? Do
> you compile to a parse table? Or 'interpret' the parsing at
> runtime,
> as I am doing.

What I have now is quite simple right now. It is LL. But, it
doesn't do the conventional lookahead for determining which
choice to take on an alternation. Instead, it simply tries
each choice in order and rewinds when a choice fails. Later, I
might change this - doing LL(1) when there is no ambiguity.
But, right now it effectively has infinite lookahead. What I'm
providing here is just a set of convienences that allow you to
build a parser like you would do by hand. The behavior is very
predictable.

> My own system is Regexp-like, featuring an
> mostly
> unoptimized DFA engine. Reg::Constant makes possible
> recursive
> patterns, which makes the whole thing LL-equivalent (I
> think).
>
> What happens when this code runs:
> "foo"|"bar"|"baz"===cursor
>
> Does cursor get scanned 3 times, for "foo", then "bar", then
> "baz"?
> That's going to be slow... especially if cursor is a
> FileCursor.

It does get scanned 3 times in this case. But, you will not
want to use Cursor::IO directly. You'll use Cursor::Buffered
instead so that you only read the file once and from the buffer
the other 2 times.

My take right now is to leave the details of how the parsing is
up to whoever is specifying the grammar. I would rather not
have too much magic going on. So in the above example, you
would be better off saying this to specify the Grammar:

"foo"|("ba"+("r"|"z"))

> Alternation doesn't need to be so slow... you can scan until
> you see
> the first f or b, then attempt a match from there. This
> eliminates the
> need to go through the cursor's data multiple times. But I
> don't see
> support for this kind of thing in Cursor. I understand your
> need to
> keep Cursor clean desire to get on to the fun stuff, but I
> beg you to
> add this one additional feature: to read data from the cursor
> position
> until you find a member of a set of items given to you by the
> user. I
> think this capability makes possible an NFA engine atop
> Cursor, which
> should be reasonably fast. And it ought to be easy enough for
> you.

The main reason I added the various options to #get was that
you have the same options in IO (implemented in C), you you'll
get a speed advantage. get(nil): IO#getc, get(n): IO#read(n),
get(eol): IO#gets(eol). I don't think adding what you are
talking about would add any efficiency if it was placed in
Cursor as opposed to the pattern-matching class (Grammar or
Reg).

> In other words, I'd like to see Cursor#get(Set) or
> Cursor#get(CharSet), where which one you pass depends on the
> type of
> cursor. (This actually gets kind of complicated when it comes
> to
> objects... in addition to a Set, you'd like to be able to
> take a Proc
> too... or maybe just an object that responds to ===.

Maybe just a Proc to handle all of these cases - passing the
current string/array. But, I still don't think you'd get much
more efficiency compared to having the Grammar/Reg do this (one
element at a time).

> > With this you can make a parser that works directly on the
> > Cursor (from an IO, String, etc)
>
> Please tell me how this would work, because I have never been
> able to
> envisage it properly, but I like the idea.
>
> > token = (tokenGrammar===ioCursor)
> >
> > tokenCursor = Cursor of tokens (lexer)
> >
> > parsetree = (parserGrammar===tokenCursor)
> >
> > Like the ideas?
>
> Well, yes, very much so. Actually, I'm suspecting esp again.
>
> Have you given any thought to having a Cursor into the object
> graph,
> or Cursors into graphs in general? This is related to
> something I want
> to do with Reg, but I haven't figured it out yet...

Not really. Cursor represents a movable pointer into a
sequential (ordered) data structure. arrays, strings, linked
lists, files, buffers, and the like make sense with this. I
guess you could assign order to the nodes in the graph and put
a Cursor around it, but I don't know if it would be that
useful.

If you want to take this off-line, you are welcome to email me
directly.

I'll try to release something for Grammar in the next week or
two.

__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site
http://smallbusiness.yahoo.com/...

Simon Strandgaard

5/28/2005 9:59:00 PM

On 5/28/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:
[snip]
> I think you'd have to rewrite it from scratch in Ruby or do
> stuff in C. Like I said, Regexp is too tied to String. If it
> could handle String-like objects, you might be able to do it.
> Or if it would have a callback when it reached the end of the
> string (or the matched failed) that might help. Unfortunately,
> there isn't enough control.
[snip]

I have made a regexp lib where you can supply your own
iterators if you have string like classes.

http://rubyforge.org/frs/download.php/1311/regexp-engin...

Maybe it has interest?

--
Simon Strandgaard

Eric Mahurin

5/29/2005 10:14:00 PM

--- Simon Strandgaard <neoneye@gmail.com> wrote:
> On 5/28/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:
> [snip]
> > I think you'd have to rewrite it from scratch in Ruby or do
> > stuff in C. Like I said, Regexp is too tied to String. If
> it
> > could handle String-like objects, you might be able to do
> it.
> > Or if it would have a callback when it reached the end of
> the
> > string (or the matched failed) that might help.
> Unfortunately,
> > there isn't enough control.
> [snip]
>
> I have made a regexp lib where you can supply your own
> iterators if you have string like classes.
>
>
http://rubyforge.org/frs/download.php/1311/regexp-engin...
>
> Maybe it has interest?

Possibly. From what I could tell this is pure ruby. So it
offers a lot more flexibility than the C Regexp. But, you
loose all of the speed advantage, right? For my purposes, that
was the only reason I wanted to incorporate Regexp - because
Grammar will be able to do Regexp like stuff - and more.

__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site
http://smallbusiness.yahoo.com/...

Simon Strandgaard

5/30/2005 6:08:00 AM

On 5/30/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:
> --- Simon Strandgaard <neoneye@gmail.com> wrote:
[snip]
> > I have made a regexp lib where you can supply your own
> > iterators if you have string like classes.
> >
> >
> http://rubyforge.org/frs/download.php/1311/regexp-engin...
> >
> > Maybe it has interest?
>
> Possibly. From what I could tell this is pure ruby. So it
> offers a lot more flexibility than the C Regexp. But, you
> loose all of the speed advantage, right?

Yes, its far from the same speed as C.

> For my purposes, that
> was the only reason I wanted to incorporate Regexp - because
> Grammar will be able to do Regexp like stuff - and more.

A grammar class sounds interesting..
Any examples of usage ?

--
Simon Strandgaard

Eric Mahurin

5/30/2005 3:25:00 PM

--- Simon Strandgaard <neoneye@gmail.com> wrote:

> On 5/30/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:
> > --- Simon Strandgaard <neoneye@gmail.com> wrote:
> [snip]
> > > I have made a regexp lib where you can supply your own
> > > iterators if you have string like classes.
> > >
> > >
> >
>
http://rubyforge.org/frs/download.php/1311/regexp-engin...
> > >
> > > Maybe it has interest?
> >
> > Possibly. From what I could tell this is pure ruby. So it
> > offers a lot more flexibility than the C Regexp. But, you
> > lose all of the speed advantage, right?
>
> Yes, its far from the same speed as C.
>
>
> > For my purposes, that
> > was the only reason I wanted to incorporate Regexp -
> because
> > Grammar will be able to do Regexp like stuff - and more.
>
> A grammar class sounds interesting..
> Any examples of usage ?

I still don't have a first release of grammar. But, you can
look at the predecessor:

http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-t...

I changed the name from syntax to grammar because there is
already a syntax package for syntax hilighting.

The usage for the Grammar classes will be very similar. Each
type of Grammar is a derived class - Sequence, Alternation,
Repeat, etc. === is used to match a Grammar against a Cursor.
And I also put stuff in String and Range to make creating a
Grammar even easier.

__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site
http://smallbusiness.yahoo.com/...

Simon Strandgaard

5/30/2005 5:00:00 PM

On 5/30/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:
> --- Simon Strandgaard <neoneye@gmail.com> wrote:
[snip]
> > A grammar class sounds interesting..
> > Any examples of usage ?
>
> I still don't have a first release of grammar. But, you can
> look at the predecessor:
>
> http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-t...
>
> I changed the name from syntax to grammar because there is
> already a syntax package for syntax hilighting.
>
> The usage for the Grammar classes will be very similar. Each
> type of Grammar is a derived class - Sequence, Alternation,
> Repeat, etc. === is used to match a Grammar against a Cursor.
> And I also put stuff in String and Range to make creating a
> Grammar even easier.

Have you any unittests for your syntax/grammar code ?

Your === operator reminds me of my own #match methods, in
this file.

http://rubyforge.org/cgi-bin/viewcvs.cgi/projects/regexp_engine/source/scanner_nodes.rb?rev=1.22&cvsroot=aeditor&content-type=text/vnd.view...

--
Simon Strandgaard

Eric Mahurin

5/30/2005 9:06:00 PM

--- Simon Strandgaard <neoneye@gmail.com> wrote:

> On 5/30/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:
> > --- Simon Strandgaard <neoneye@gmail.com> wrote:
> [snip]
> > > A grammar class sounds interesting..
> > > Any examples of usage ?
> >
> > I still don't have a first release of grammar. But, you
> can
> > look at the predecessor:
> >
> >
>
http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-t...
> >
> > I changed the name from syntax to grammar because there is
> > already a syntax package for syntax hilighting.
> >
> > The usage for the Grammar classes will be very similar.
> Each
> > type of Grammar is a derived class - Sequence, Alternation,
> > Repeat, etc. === is used to match a Grammar against a
> Cursor.
> > And I also put stuff in String and Range to make creating a
> > Grammar even easier.
>
> Have you any unittests for your syntax/grammar code ?

Not yet.

> Your === operator reminds me of my own #match methods, in
> this file.
>
>
http://rubyforge.org/cgi-bin/viewcvs.cgi/projects/regexp_engine/source/scanner_nodes.rb?rev=1.22&cvsroot=aeditor&content-type=text/vnd.view...

Yes, this is quite similar. You, Caleb, and I look to have
very similar code and needs - external iterators and pattern
matches/parsers.

__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new Resources site
http://smallbusiness.yahoo.com/...

comp.lang.ruby

Re: [ANN] cursor-0.6

Eric Mahurin

Caleb Clausen

Eric Mahurin

Simon Strandgaard

Eric Mahurin

Simon Strandgaard

Eric Mahurin

Simon Strandgaard

Eric Mahurin

x Login to ForumsZone