Robert Klemme
12/11/2007 8:23:00 AM
2007/12/11, Tim Hunter <TimHunter@nc.rr.com>:
> Curt Sampson wrote:
> > On 2007-12-11 02:44 +0900 (Tue), Robert Klemme wrote:
> >
> >> This is what I'd do: create a single string per line and use substring
> >> (aka #[]) to create strings that represent the portion needed; byte
> >> buffer will be shared then. You don't even need to freeze them because
> >> of copy on write.
> >
> > This was attractive for a couple of seconds, until I realized that not
> > only does it still add a copy of the entire row of data (albeit as one
> > large allocation rather than many small ones), but it also doesn't
> > reduce my object creation load at all. I seem to recall last time I was
> > playing around with this sort of thing and using a profiler, GC was an
> > enormous cost for me. This probably isn't surprising given the nature of
> > the problem; a typical file might be ten million rows of fifty elements
> > each, which would be 500 million object creations and collections.
> >
> > cjs
>
> For a problem of this scale, it seems like it would make sense to use a
> custom class that had some of the methods of String - enough for the
> callees to treat it like a String - but not in fact String. Give it a
> .to_s method to convert to a real String when it's really necessary.
But you still have the GC overhead - regardless whether you create
String or Object.
Another idea would be to change the interface. "Pseudo" code:
io.each do |line|
line.freeze
some_smart_parsing do |line, start, end|
# line does not change, start and end are integer indexes
end
end
I.e., that way you would create a single String per line only while
allowing the caller to create substring instances if needed. If the
client code needs to do that anyway you could as well create those
String instances yourself because it does not make any difference. If
not, you save factor of 50 creations (ints are treated differently).
But I'd say chances are that client code will do more complex
manipulations and in that case the question is whether you have the
right (i.e. fast enough) tool at all. Because these blocks *will* do
some calculation and those will likely also create objects etc. I'd
say you either have to live with the overhead or use a different tool
altogether.
Here's another variant: if you mmap the file and it fits into mem you
could as well do
some_smart_parsing do |s, start, end, field_index|
# s is the whole file and does not change, start and end are integer indexes
end
The field index could be a flag indicating first or not first field in
a record. But this interface starts to get contrived and you still
have the overhead in the block.
Kind regards
robert
--
use.inject do |as, often| as.you_can - without end