Asp Forum - Zero-copy String Handling

Curt Sampson

12/10/2007 3:52:00 PM

I'm writing a C extension that involves fast scanning through and
parsing of tab-delimited files. Basically, I mmap the file, figure out
where the row and column boundaries are, and for each row end up with
an array of strings (pointer and length) for each row that I then pass
on to other C or Ruby code. The array and its strings are not supposed
to be modified by the callees, only read, and I can also live with the
callees being required to make their own copies of the strings and
arrays if they need to keep the data accessable after the call, if I can
figure out some way to enforce that.

It appears to me that this means I don't really have any need to
copy the data; I ought to just be able to set up a bunch of (likely
frozen) String objects and then tweak the ptr and len on them and pass
them around, avoiding any allocations or data copies. From a bit of
experimentation, I can see that dropping several calls to rb_str_new for
each row results in an enormous speed increase--about ten-fold--in how
fast I can scan through the file.

Does anybody have any suggestions on a reasonably safe way to do this?

cjs
--
Curt Sampson <cjs@starling-software.com> +81 90 7737 2974
Mobile sites and software consulting: http://www.starling-so...

10 Answers

Eric I.

12/10/2007 4:15:00 PM

On Dec 10, 10:52 am, Curt Sampson <c...@cynic.net> wrote:
> It appears to me that this means I don't really have any need to
> copy the data; I ought to just be able to set up a bunch of (likely
> frozen) String objects and then tweak the ptr and len on them and pass
> them around, avoiding any allocations or data copies. From a bit of
> experimentation, I can see that dropping several calls to rb_str_new for
> each row results in an enormous speed increase--about ten-fold--in how
> fast I can scan through the file.
>
> Does anybody have any suggestions on a reasonably safe way to do this?

I've read (although I've never verified this by looking at the source
code) that Ruby's strings are copy-on-write. So if you have code like
this:

s1 = "This is some string."
s2 = s1[5, 7] # => "is some"
s3 = s1[5..11] # => "is some"

no copies are made and data is shared ... until you modify the data,
in which case the copy is made.

If you wanted to enforce the read-only aspect, perhaps you could
create a subclass of String where all the methods and operators that
modify the character data are re-implemented to raise exceptions.

Eric

====

Interested in hands-on, on-site Ruby training? See http://Lea...
for information about a well-reviewed class.

Robert Klemme

12/10/2007 5:40:00 PM

On 10.12.2007 16:52, Curt Sampson wrote:
> I'm writing a C extension that involves fast scanning through and
> parsing of tab-delimited files. Basically, I mmap the file, figure out
> where the row and column boundaries are, and for each row end up with
> an array of strings (pointer and length) for each row that I then pass
> on to other C or Ruby code. The array and its strings are not supposed
> to be modified by the callees, only read, and I can also live with the
> callees being required to make their own copies of the strings and
> arrays if they need to keep the data accessable after the call, if I can
> figure out some way to enforce that.
>
> It appears to me that this means I don't really have any need to
> copy the data; I ought to just be able to set up a bunch of (likely
> frozen) String objects and then tweak the ptr and len on them and pass
> them around, avoiding any allocations or data copies. From a bit of
> experimentation, I can see that dropping several calls to rb_str_new for
> each row results in an enormous speed increase--about ten-fold--in how
> fast I can scan through the file.
>
> Does anybody have any suggestions on a reasonably safe way to do this?

This is what I'd do: create a single string per line and use substring
(aka #[]) to create strings that represent the portion needed; byte
buffer will be shared then. You don't even need to freeze them because
of copy on write.

Kind regards

robert

ara.t.howard

12/10/2007 7:13:00 PM

On Dec 10, 2007, at 8:52 AM, Curt Sampson wrote:

> Does anybody have any suggestions on a reasonably safe way to do this?

this code shares memory between an mmap and narray

http://codeforp...lib/...

look at na_str.c, it uses rb_str_new4 to avoid copying.

a @ http://codeforp...
--
share your knowledge. it's a way to achieve immortality.
h.h. the 14th dalai lama

Curt Sampson

12/10/2007 11:39:00 PM

On 2007-12-11 02:44 +0900 (Tue), Robert Klemme wrote:

> This is what I'd do: create a single string per line and use substring
> (aka #[]) to create strings that represent the portion needed; byte
> buffer will be shared then. You don't even need to freeze them because
> of copy on write.

This was attractive for a couple of seconds, until I realized that not
only does it still add a copy of the entire row of data (albeit as one
large allocation rather than many small ones), but it also doesn't
reduce my object creation load at all. I seem to recall last time I was
playing around with this sort of thing and using a profiler, GC was an
enormous cost for me. This probably isn't surprising given the nature of
the problem; a typical file might be ten million rows of fifty elements
each, which would be 500 million object creations and collections.

cjs
--
Curt Sampson <cjs@starling-software.com> +81 90 7737 2974
Mobile sites and software consulting: http://www.starling-so...

Tim Hunter

12/10/2007 11:50:00 PM

Curt Sampson wrote:
> On 2007-12-11 02:44 +0900 (Tue), Robert Klemme wrote:
>
>> This is what I'd do: create a single string per line and use substring
>> (aka #[]) to create strings that represent the portion needed; byte
>> buffer will be shared then. You don't even need to freeze them because
>> of copy on write.
>
> This was attractive for a couple of seconds, until I realized that not
> only does it still add a copy of the entire row of data (albeit as one
> large allocation rather than many small ones), but it also doesn't
> reduce my object creation load at all. I seem to recall last time I was
> playing around with this sort of thing and using a profiler, GC was an
> enormous cost for me. This probably isn't surprising given the nature of
> the problem; a typical file might be ten million rows of fifty elements
> each, which would be 500 million object creations and collections.
>
> cjs

For a problem of this scale, it seems like it would make sense to use a
custom class that had some of the methods of String - enough for the
callees to treat it like a String - but not in fact String. Give it a
to_s method to convert to a real String when it's really necessary.

--
RMagick: http://rmagick.ruby...

Curt Sampson

12/11/2007 1:41:00 AM

On 2007-12-11 08:50 +0900 (Tue), Tim Hunter wrote:

> For a problem of this scale, it seems like it would make sense to use a
> custom class that had some of the methods of String - enough for the
> callees to treat it like a String - but not in fact String. Give it a .to_s
> method to convert to a real String when it's really necessary.

That's a good idea. In fact, I could use a struct of the same shape as
RString, and just invoke methods in string.c to do the real work for the
methods I decide to implement. I could also leave the objects existant
after the backing pages have been unmapped (as I do now with the String
objects), but tweak them so that they throw an exception when you try
to use them after that, as opposed to the current situation where the
interpreter dumps core.

Thanks for the idea.

cjs
--
Curt Sampson <cjs@starling-software.com> +81 90 7737 2974
Mobile sites and software consulting: http://www.starling-so...

Tim Hunter

12/11/2007 1:56:00 AM

Curt Sampson wrote:
> On 2007-12-11 08:50 +0900 (Tue), Tim Hunter wrote:
>
>> For a problem of this scale, it seems like it would make sense to use a
>> custom class that had some of the methods of String - enough for the
>> callees to treat it like a String - but not in fact String. Give it a .to_s
>> method to convert to a real String when it's really necessary.
>
> That's a good idea. In fact, I could use a struct of the same shape as
> RString, and just invoke methods in string.c to do the real work for the
> methods I decide to implement. I could also leave the objects existant
> after the backing pages have been unmapped (as I do now with the String
> objects), but tweak them so that they throw an exception when you try
> to use them after that, as opposed to the current situation where the
> interpreter dumps core.
>
> Thanks for the idea.
>
> cjs

Works for me. RMagick 2 has the ability to "destroy" an image. A
destroyed image is like a closed file. The object still exists but any
attempt to use it raises a DestroyedImage exception.

--
RMagick: http://rmagick.ruby...

Robert Klemme

12/11/2007 8:23:00 AM

2007/12/11, Tim Hunter <TimHunter@nc.rr.com>:
> Curt Sampson wrote:
> > On 2007-12-11 02:44 +0900 (Tue), Robert Klemme wrote:
> >
> >> This is what I'd do: create a single string per line and use substring
> >> (aka #[]) to create strings that represent the portion needed; byte
> >> buffer will be shared then. You don't even need to freeze them because
> >> of copy on write.
> >
> > This was attractive for a couple of seconds, until I realized that not
> > only does it still add a copy of the entire row of data (albeit as one
> > large allocation rather than many small ones), but it also doesn't
> > reduce my object creation load at all. I seem to recall last time I was
> > playing around with this sort of thing and using a profiler, GC was an
> > enormous cost for me. This probably isn't surprising given the nature of
> > the problem; a typical file might be ten million rows of fifty elements
> > each, which would be 500 million object creations and collections.
> >
> > cjs
>
> For a problem of this scale, it seems like it would make sense to use a
> custom class that had some of the methods of String - enough for the
> callees to treat it like a String - but not in fact String. Give it a
> .to_s method to convert to a real String when it's really necessary.

But you still have the GC overhead - regardless whether you create
String or Object.

Another idea would be to change the interface. "Pseudo" code:

io.each do |line|
line.freeze
some_smart_parsing do |line, start, end|
# line does not change, start and end are integer indexes
end
end

I.e., that way you would create a single String per line only while
allowing the caller to create substring instances if needed. If the
client code needs to do that anyway you could as well create those
String instances yourself because it does not make any difference. If
not, you save factor of 50 creations (ints are treated differently).
But I'd say chances are that client code will do more complex
manipulations and in that case the question is whether you have the
right (i.e. fast enough) tool at all. Because these blocks *will* do
some calculation and those will likely also create objects etc. I'd
say you either have to live with the overhead or use a different tool
altogether.

Here's another variant: if you mmap the file and it fits into mem you
could as well do

some_smart_parsing do |s, start, end, field_index|
# s is the whole file and does not change, start and end are integer indexes
end

The field index could be a flag indicating first or not first field in
a record. But this interface starts to get contrived and you still
have the overhead in the block.

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

Curt Sampson

12/11/2007 3:42:00 PM

On 2007-12-11 17:22 +0900 (Tue), Robert Klemme wrote:

> 2007/12/11, Tim Hunter <TimHunter@nc.rr.com>:
>
> > For a problem of this scale, it seems like it would make sense to use a
> > custom class that had some of the methods of String - enough for the
> > callees to treat it like a String - but not in fact String. Give it a
> > .to_s method to convert to a real String when it's really necessary.
>
> But you still have the GC overhead - regardless whether you create
> String or Object.

Not necessarially. If you set the contract, as I have, that the object
is valid only for the duration of execution of the callback to which
it's passed, and that validity expires once the callback returns, you
can allocate 50 (or whatever) of these read-only pseudo-String objects
at the start of reading the file, and re-use them for each row. The key
is that anything that the callback wants to save for use after this
particular call has exited will have to be copied.

> I.e., that way you would create a single String per line only while
> allowing the caller to create substring instances if needed. If the
> client code needs to do that anyway you could as well create those
> String instances yourself because it does not make any difference. If
> not, you save factor of 50 creations

When you're talking about processing twenty and thirty million row
files, even with the factor of 50 savings you've still got a problem.
(Though I should benchmark this to see how it compares. A quick
comparison on my machine:

a = "aaaa"
n = 10_000_000
n10 = 100_000_000
n.times { } # 2 seconds
n.times { rand(n10) } # 16 seconds
n.times { a + 'b' } # 24 seconds
n.times { rand(n10}.to_s } # 27 seconds

[Oops, looks like "a + 'b'" isn't a no-copy operation after the first
one.])

I should perhaps explain that the common case is that the strings
are examined, but not modified. Particularly common is a relational
restriction: abort processing that row (for that particular chain of
Conditions and Results) if a data item doesn't match a particular value,
or fall in between some particular values, or whatever. Actually needing
to copy a value is relatively rare. More frequent would be incrementing
a counter if a value matches some condition, or adding the value to a
total.

I should also explain that having these pseudo-strings as actual Ruby
objects immediately after parsing is partly a convenience; in something
that needed to be really fast, the initial Conditions, at least, and
later ones and Results, if they match frequently enough, would be in C,
and I suppose I could delay creating the String objects until I had to
call a Ruby routine.

But as it stands, even the callbacks to Ruby are surprisingly fast, if
they don't do a lot. On my machine, for an 800,000 line file (entirely
cached in memory), a scan running only C code takes 1.8 seconds of
CPU. Doing an rb_str_new on each line brings it up to 2.3 seconds;
doing instead a call into a Ruby function that does a simple string
comparison against an instance variable takes 3.4 seconds.

> But I'd say chances are that client code will do more complex
> manipulations....

As above; in many cases the intial comparisons to discard the line can
be done by simple C code.

> Here's another variant: if you mmap the file and it fits into mem...

Sometimes not my case, unfortunately; I have to be able to deal with
files several gigabytes long on a 32-bit machine.

Thanks for all of the useful advice.

cjs
--
Curt Sampson <cjs@starling-software.com> +81 90 7737 2974
Mobile sites and software consulting: http://www.starling-so...

Adam Shelly

12/11/2007 7:06:00 PM

On 12/11/07, Curt Sampson <cjs@cynic.net> wrote:
> On 2007-12-11 17:22 +0900 (Tue), Robert Klemme wrote:
> > 2007/12/11, Tim Hunter <TimHunter@nc.rr.com>:
> >
> > > For a problem of this scale, it seems like it would make sense to use a
> > > custom class that had some of the methods of String - enough for the
> > > callees to treat it like a String - but not in fact String. Give it a
> > > .to_s method to convert to a real String when it's really necessary.
> >
>
> I should perhaps explain that the common case is that the strings
> are examined, but not modified. Particularly common is a relational
> restriction: abort processing that row (for that particular chain of
> Conditions and Results) if a data item doesn't match a particular value,
> or fall in between some particular values, or whatever. Actually needing
> to copy a value is relatively rare. More frequent would be incrementing
> a counter if a value matches some condition, or adding the value to a
> total.
>
> I should also explain that having these pseudo-strings as actual Ruby
> objects immediately after parsing is partly a convenience; in something
> that needed to be really fast, the initial Conditions, at least, and
> later ones and Results, if they match frequently enough, would be in C,
> and I suppose I could delay creating the String objects until I had to
> call a Ruby routine.
>
Would the immutable Rope implementation from RubyQuiz 137 help here?

comp.lang.ruby

Zero-copy String Handling

Curt Sampson

Eric I.

Robert Klemme

ara.t.howard

Curt Sampson

Tim Hunter

Curt Sampson

Tim Hunter

Robert Klemme

Curt Sampson

Adam Shelly

x Login to ForumsZone