David G. Andersen
10/26/2004 2:38:00 AM
On Tue, Oct 26, 2004 at 10:06:55AM +0900, David G. Andersen scribed:
>
> Ahh, thanks. So the problem is really in GzipReader's each_line
> handling.
> [...]
> But gzreader_gets... is a dog. It does a memcmp()
> on each byte of the input string to test it against
I've attached a patch that reduces some of the overhead
for files with longer lines (but doesn't fix all of the
slowdowns). Some benchmarks, w/1.8.1 on FreeBSD,
grabbing data out of the gzipped file with file.gets():
"tarfile" - compressed JDK. Line length is long (random data...)
"words" - /usr/share/dict/words gzipped. Lines are very short.
"logfile" - logfile from one of my experiments. Lines are
between 15 and 120 bytes long.
popen GzReader-orig GzReader-patched
----- ------------- ----------------
tarfile 2.06 5.65 2.95
words 0.914 2.4 2.22
logfile 1.18 3.65 2.27
The patch is tiny and non-intrusive, which is a bonus, though its
performance improvement is not spectacular for short lines. Helps
with gzipped logfiles, at least, but someone with more {time,
knowledge of ruby's internals} might want to go in and overhaul
things for real.
-Dave
--- orig-zlib.c Mon Oct 25 22:01:18 2004
+++ zlib.c Mon Oct 25 22:33:26 2004
@@ -2470,7 +2470,7 @@
{
struct gzfile *gz = get_gzfile(obj);
VALUE rs, dst;
- char *rsptr, *p;
+ char *rsptr, *p, *res;
long rslen, n;
int rspara;
@@ -2520,8 +2520,15 @@
gzfile_read_more(gz);
p = RSTRING(gz->z.buf)->ptr + n - rslen;
}
- if (memcmp(p, rsptr, rslen) == 0) break;
- p++, n++;
+ res = memchr(p, rsptr[0], (gz->z.buf_filled - n + 1));
+ if (!res) {
+ n = gz->z.buf_filled + 1;
+ } else {
+ n += (long)(res - p);
+ p = res;
+ if (rslen == 1 || memcmp(p, rsptr, rslen) == 0) break;
+ p++, n++;
+ }
}
gz->lineno++;