Asp Forum - Iterating through a string and removing leading characters

Randy Kramer

3/18/2005 5:46:00 PM

This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of several
REs on the next part of the string. I'm looking to find a very efficient way
to do this.

Here are the approaches I'm considering so far. Any other suggestions?

* ?? (Is there any method which chops the first character from the string?
How efficient is that (especially compared to the method described in the
next bullet).
* Reverse the string, then use something like s[length (-1)] to examine
then chop to discard the last character. The problem with this approach is
that I then have to reverse the string again to have the REs work. (I
(briefly) thought about just setting up the REs in reverse, but I foresee a
lot of difficulty there--scanning through the string in reverse (from last
character to first, negates the advantage I was hoping to gain, that of
starting the RE match only from positions where the RE could possibly match.)

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
thing I want to build, and am trying to avoid the approach of simply making
multiple scans through the document for each of a fairly large number of REs.

I might be guilty of premature optimization, but I prefer to think of it as
doing some proof of concept testing before committing to a design.

I have done some tests that show a 1 to 10% savings in time by taking a
similar approach for REs that could only match at the beginning of a string.
(At some point I'll "publish" those results on WikiLearn (or the next
incarnation thereof).) The next REs are considerably more complex as they
can match anywhere in the string--if the savings from the same approach for
them is only 1 to 10%, the complexity will not be worth it. If by some
chance it exceeds say 50%, I will seriously consider that complexity.

Randy Kramer

43 Answers

ES

3/18/2005 8:42:00 PM

In data 3/18/2005, "Randy Kramer" <rhkramer@gmail.com> ha scritto:

>This is going to seem a little strange (for a number of reasons I might
>mention below), but I would like to iterate through a string from beginning
>to end, examining each character, then discarding (or moving) them
>(elsewhere).
>
>This is so that I can, upon finding certain character(s), run one of several
>REs on the next part of the string. I'm looking to find a very efficient way
>to do this.
>
>Here are the approaches I'm considering so far. Any other suggestions?
>
> * ?? (Is there any method which chops the first character from the string?
>How efficient is that (especially compared to the method described in the
>next bullet).
> * Reverse the string, then use something like s[length (-1)] to examine
>then chop to discard the last character. The problem with this approach is
>that I then have to reverse the string again to have the REs work. (I
>(briefly) thought about just setting up the REs in reverse, but I foresee a
>lot of difficulty there--scanning through the string in reverse (from last
>character to first, negates the advantage I was hoping to gain, that of
>starting the RE match only from positions where the RE could possibly match.)
>
>Barring anything better, the approach I may take is to iterate through the
>string and, when I find a potential RE match, use s[n,len] to return a
>partial string to be checked against the RE.
>
>Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
>thing I want to build, and am trying to avoid the approach of simply making
>multiple scans through the document for each of a fairly large number of REs.
>
>I might be guilty of premature optimization, but I prefer to think of it as
>doing some proof of concept testing before committing to a design.
>
>I have done some tests that show a 1 to 10% savings in time by taking a
>similar approach for REs that could only match at the beginning of a string.
>(At some point I'll "publish" those results on WikiLearn (or the next
>incarnation thereof).) The next REs are considerably more complex as they
>can match anywhere in the string--if the savings from the same approach for
>them is only 1 to 10%, the complexity will not be worth it. If by some
>chance it exceeds say 50%, I will seriously consider that complexity.

I implemented a semi-stateful wiki parser in a somewhat similar manner
using StringScanner. It's fairly fast.

>Randy Kramer

E

Florian Gross

3/18/2005 9:22:00 PM

Randy Kramer wrote:

> * ?? (Is there any method which chops the first character from the string?
> How efficient is that (especially compared to the method described in the
> next bullet).

str.slice!(0)

I think this is actually fairly efficient, but I suppose there's a
memmove involved at the C level.

Perhaps you could set the characters you don't need to '\0' and remove
all of them after the iteration has ended with a Regexp?

Christian Neukirchen

3/19/2005 11:04:00 AM

Florian Gross <flgr@ccan.de> writes:

> Randy Kramer wrote:
>
>> * ?? (Is there any method which chops the first character from
>> the string? How efficient is that (especially compared to the
>> method described in the next bullet).
>
> str.slice!(0)
>
> I think this is actually fairly efficient, but I suppose there's a
> memmove involved at the C level.
>
> Perhaps you could set the characters you don't need to '\0' and remove
> all of them after the iteration has ended with a Regexp?

Better use String#delete! for that, no?

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneuk...

Robert Klemme

3/19/2005 12:55:00 PM

"Christian Neukirchen" <chneukirchen@gmail.com> schrieb im Newsbeitrag
news:m2vf7n28y0.fsf@lilith.local...
> Florian Gross <flgr@ccan.de> writes:
>
>> Randy Kramer wrote:
>>
>>> * ?? (Is there any method which chops the first character from
>>> the string? How efficient is that (especially compared to the
>>> method described in the next bullet).
>>
>> str.slice!(0)
>>
>> I think this is actually fairly efficient, but I suppose there's a
>> memmove involved at the C level.
>>
>> Perhaps you could set the characters you don't need to '\0' and remove
>> all of them after the iteration has ended with a Regexp?
>
> Better use String#delete! for that, no?

For removing \0 chars? Yepp, that seems like an efficient way to do it.
For direct removing slice! and String#[]= are most efficient:

>> Benchmark.bm do
?> end
user system total real
=> true
>> Benchmark.bm do |r|
?> r.report("slice!") do
?> 10000.times { "aaaaaaaaaaaaaa".slice!(0,5) }
>> end
>> r.report("[]=") do
?> 10000.times { "aaaaaaaaaaaaaa"[0,5]="" }
>> end
>> end
user system total real
slice! 0.140000 0.000000 0.140000 ( 0.130000)
[]= 0.063000 0.000000 0.063000 ( 0.065000)
=> true
>>

Kind regards

robert

Robert Klemme

3/19/2005 1:04:00 PM

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag
news:200503181244.16009.rhkramer@gmail.com...
> This is going to seem a little strange (for a number of reasons I might
> mention below), but I would like to iterate through a string from
> beginning
> to end, examining each character, then discarding (or moving) them
> (elsewhere).
>
> This is so that I can, upon finding certain character(s), run one of
> several
> REs on the next part of the string. I'm looking to find a very efficient
> way
> to do this.
>
> Here are the approaches I'm considering so far. Any other suggestions?
>
> * ?? (Is there any method which chops the first character from the
> string?
> How efficient is that (especially compared to the method described in the
> next bullet).
> * Reverse the string, then use something like s[length (-1)] to examine
> then chop to discard the last character. The problem with this approach
> is
> that I then have to reverse the string again to have the REs work. (I
> (briefly) thought about just setting up the REs in reverse, but I foresee
> a
> lot of difficulty there--scanning through the string in reverse (from last
> character to first, negates the advantage I was hoping to gain, that of
> starting the RE match only from positions where the RE could possibly
> match.)
>
> Barring anything better, the approach I may take is to iterate through the
> string and, when I find a potential RE match, use s[n,len] to return a
> partial string to be checked against the RE.

I'd bet that this approach is slower than a pure regexp based approach. If
you cannot stuck all exact regexps into one (see below) then maybe some form
of stripped regexps might help. For example:

rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/
# then, use these on the second part
rx_stripped_1 = /^b+/
rx_stripped_2 = /^d+/

This is just a simple example for demonstration. For these simple regexps
rx_all is the most efficient one I'm sure.

> Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
> thing I want to build, and am trying to avoid the approach of simply
> making
> multiple scans through the document for each of a fairly large number of
> REs.

What does "fairly large" mean? I would try to start with stucking *all*
these regexps into one - if the rx engine does not choke on that regexp I'd
assume that this is the most efficient way to do it, as then you have the
best ratio of machine code to ruby interpretation. Maybe you just show us
all these regexps so we can better understand the problem.

> I might be guilty of premature optimization, but I prefer to think of it
> as
> doing some proof of concept testing before committing to a design.
>
> I have done some tests that show a 1 to 10% savings in time by taking a
> similar approach for REs that could only match at the beginning of a
> string.
> (At some point I'll "publish" those results on WikiLearn (or the next
> incarnation thereof).) The next REs are considerably more complex as they
> can match anywhere in the string--if the savings from the same approach
> for
> them is only 1 to 10%, the complexity will not be worth it. If by some
> chance it exceeds say 50%, I will seriously consider that complexity.

Now I'm getting really curios. Care to post some more details?

Kind regards

robert

Christian Neukirchen

3/19/2005 1:20:00 PM

"Robert Klemme" <bob.news@gmx.net> writes:

> "Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag
> news:200503181244.16009.rhkramer@gmail.com...
>> This is going to seem a little strange (for a number of reasons I might
>> mention below), but I would like to iterate through a string from
>> beginning
>> to end, examining each character, then discarding (or moving) them
>> (elsewhere).
>>
>> This is so that I can, upon finding certain character(s), run one of
>> several
>> REs on the next part of the string. I'm looking to find a very
>> efficient way
>> to do this.

>> Barring anything better, the approach I may take is to iterate through the
>> string and, when I find a potential RE match, use s[n,len] to return a
>> partial string to be checked against the RE.
>
> I'd bet that this approach is slower than a pure regexp based
> approach. If you cannot stuck all exact regexps into one (see below)
> then maybe some form of stripped regexps might help. For example:
>
> rx1 = /ab+/
> rx2 = /cd+/
>
> rx_all = /(ab+)|(cd+)/
>
> rx_stripped = /[ab](\w+)/
> # then, use these on the second part
> rx_stripped_1 = /^b+/
> rx_stripped_2 = /^d+/

This reminds me of my old idea for matching against multiple regexp
by automatically combining them into larger, alternating regexps
thereby reducing the number of matches from n to log2(n)...
Anyone willing to code that?

To match t against A, B, C, D, E, F, G and H, you first match t
against A|B|C|D|E|F|G|H, then against A|B|C|D, then against A|B (or
E|F), then against A, C, E or G. Would be of great use for strscan,
too.

> robert
--
Christian Neukirchen <chneukirchen@gmail.com> http://chneuk...

Randy Kramer

3/19/2005 2:02:00 PM

Thanks to all who replied so far. I also want to look into the StringScanner
approach (I'll reply separately with some questions about that), and I can't
believe I couldn't find the ways to delete the first character of a string.
Guess I am a newbie!

On Saturday 19 March 2005 08:04 am, Robert Klemme wrote:
> I'd bet that this approach is slower than a pure regexp based approach.

So far, you're very right--my approach took about 30 times as long as the pure
regexp approach, although my Ruby code might not be very efficient. (In case
nobody noticed, I'm very much a newbie to Ruby.)

> If
> you cannot stuck all exact regexps into one (see below) then maybe some
> form of stripped regexps might help. For example:

This sounds like its worth a try, but:
1) I haven't created all the necessary REs yet
2) Question below (for clarification)

> rx1 = /ab+/
> rx2 = /cd+/
>
> rx_all = /(ab+)|(cd+)/
>
> rx_stripped = /[ab](\w+)/

Question: IIUC, the [ab] above should be [ac]?

> # then, use these on the second part
> rx_stripped_1 = /^b+/
> rx_stripped_2 = /^d+/
>
> This is just a simple example for demonstration. For these simple regexps
> rx_all is the most efficient one I'm sure.

> What does "fairly large" mean? I would try to start with stucking *all*
> these regexps into one - if the rx engine does not choke on that regexp I'd
> assume that this is the most efficient way to do it, as then you have the
> best ratio of machine code to ruby interpretation. Maybe you just show us
> all these regexps so we can better understand the problem.

It's hard even to guess, I intended to combine several REs into one anyway
when they had a lot of commonality. For example, the TWiki markup for
headings (which I'm planning to use) is like this:

---* Level 1
---** Level 2
---*** Level 3
---**** Level 4
---***** Level 5
---****** Level 6

I've planned to use one RE for all the above, then determine the level from
the length of the match (like level = len - 3).

Likewise, "inline" markup is *for bold*, _for italic,_ __for bold italic__,
and so forth. I'd try to have one RE looking for words preceded by _, *, or
__, and another with words ending with the same. (And might combine words
marked with % for %TWikiVariables% as well.

With "optimizations" like this, I'd guess on the order of 15 or so regexps.

> Now I'm getting really curios. Care to post some more details?

I presume you mean on the 1 to 10% savings? I planned to do that, I'll try to
put something on WikiLearn this weekend then post something here.

Randy Kramer

Randy Kramer

3/19/2005 6:25:00 PM

On Friday 18 March 2005 03:42 pm, ES wrote:
> I implemented a semi-stateful wiki parser in a somewhat similar manner
> using StringScanner. It's fairly fast.

StringScanner looks interesting, and I'm starting to dig into it. Care to
provide any more information? Do you have a working wiki?

Randy Kramer

ES

3/19/2005 6:33:00 PM

In data 3/19/2005, "Randy Kramer" <rhkramer@gmail.com> ha scritto:

>On Friday 18 March 2005 03:42 pm, ES wrote:
>> I implemented a semi-stateful wiki parser in a somewhat similar manner
>> using StringScanner. It's fairly fast.
>
>StringScanner looks interesting, and I'm starting to dig into it. Care to
>provide any more information? Do you have a working wiki?

I do. I was waiting to complete the actual application it is
to feature in before releasing it but if you want to see the
parsing code, I'll do a round or two of refactoring and post
it on my website, work permitting.

>Randy Kramer

E

Randy Kramer

3/19/2005 8:55:00 PM

On Saturday 19 March 2005 07:59 am, Robert Klemme wrote:
> "Christian Neukirchen" <chneukirchen@gmail.com> schrieb im Newsbeitrag
> news:m2vf7n28y0.fsf@lilith.local...
> > Better use String#delete! for that, no?

> For direct removing slice! and String#[]= are most efficient:

What is the significance to the # in String#delete! and String#[]= above? (I
can't find anything with the # in it like that in "Ruby In a Nutshell"--is it
something new, or just some shorthand/jargon? (Or did I look in the wrong
places ;-)

> >> Benchmark.bm do

Ahh, that's useful, I was using Time::now to time my functions.

Randy Kramer

comp.lang.ruby

Iterating through a string and removing leading characters

Randy Kramer

ES

Florian Gross

Christian Neukirchen

Robert Klemme

Robert Klemme

Christian Neukirchen

Randy Kramer

Randy Kramer

ES

Randy Kramer

x Login to ForumsZone