Wincent Colaiuta
2/8/2008 9:58:00 AM
On 6 feb, 20:43, Phlip <phlip2...@gmail.com> wrote:
> Wincent Colaiuta wrote:
> > I'm working on a fast wikitext-to-HTML translator written in C and am
> > trying to optimize it.
>
> > Profiling shows that a fair portion of the time is being spent
> > instantiating temporary strings and much of that time is spent in
> > memcpy.
>
> Shouldn't you just focus on streaming? I always figured that getting faster than
> fgetc() and fputc() would requiring reinventing all the memory-mapped files and
> DMA that back those up...
I'm not sure what you mean. Can you clarify a bit? The translator
consists of an extremely faster Ragel tokenizer/lexer/scanner that
emits tokens one at a time, and a parser that decides what to do with
them. The tokens themselves don't contain copies of the corresponding
substrings, but pointers into the input string. The parser is not a
recursive descent parser but an on-the-fly translator that decides
what to emit as it sees each token (thus avoiding function call
overhead; we don't need the same level of structured parsing that a
recursive descent parser would give us because we're only transforming
wikitext into HTML, not parsing a programming language; wikitext can
be imperfect and have syntax errors whereas a strictly defined
programming language must not).
But when it comes time to emit the output you generally have to
instantiate some strings. As an example, consider parsing the
wikitext, ''foo'', which should be translated into <em>foo</em>. At
the Ruby level you want a simple interface, where you pass in a
standard String object and get back a standard String object:
parser.parse "''foo''"
=> "<em>foo</em>"
So under the covers in the C extension this is what happens:
0. Initialize empty output buffer, currently just an empty Ruby
string
1. Provide tokenizer with a pointer to the start of the input string
(no copies are made)
2. Ask tokenizer for the next token
3. Get an "EM" token (no copy of token text made, just a pointer
into the input string
4. Parser appends "<em>" to output buffer using rb_str_cat (avoiding
instantiating of Ruby String)
5. Ask tokenizer for next token
6. Get a "PRINTABLE" token (no copy of token text made, just a
pointer into the input string)
7. Parser appends "foo" to output buffer using rb_str_cat (again
avoiding instantiating a Ruby String)
8. Ask tokenizer for next token
7. Get an "EM" token
9. Parser appends "</em> to output buffer using rb_str_cat
10. Return finished String to caller
So in a general sense it is a "streaming" transformer (meaning that it
transforms the input on the fly without doing any lookahead except in
some special circumstances where it has to ask the lexer for another
token in order to decide what to emit), but I gather from your comment
that you're referring to something else.
So in looking at the code there are now very few String/Object
instantiations that I can avoid because I am already doing as much as
I can at the raw byte level (with rb_str_cat and memcpy itself).
Nevertheless it looks like those few sites are now the lowest-hanging-
fruit reported in profiling and in order to make it faster that's what
I have to attack. So I sent my original email in this thread because
there are still _some_ temporary instantiations that I could avoid on
occasions where the parser calls out to other functions (which are
actually mostly inlined), and being able to pass in a buffer without
having Ruby memcpy it would be a great help.
Cheers,
Wincent