Asp Forum - Faster Marshaling?

Greg Willits

7/27/2008 2:58:00 AM

Exploring options... wondering if there's anything that can replace
marshaling that's similar in usage (dump & load to/from disk file), but
faster than the native implementation in Ruby 1.8.6

I can explain some details if necessary, but in short:

- I need to marshal,

- I need to swap data sets often enough that performance
will be a problem (currently it can take several seconds to restore
some marshaled data -- way too long)

- the scaling is such that more RAM per box is costly enough to pay for
development of a more RAM efficient design

- faster performance Marshaling is worth asking about to see how much
it'll get me.

I'm hoping there's something that's as close to a memory space dump &
restore as possible -- no need to "reconstruct" data piece by piece
which Ruby seems to be doing now. It takes < 250ms to load an 11MB raw
data file via readlines, and 2 seconds to load a 9MB sized Marshal file,
so clearly Ruby is busy rebuilding stuff rather than just pumping a RAM
block with a binary image.

TIA for any ideas.

-- gw
--
Posted via http://www.ruby-....

12 Answers

Eric Hodel

7/27/2008 3:24:00 AM

On Jul 26, 2008, at 19:58 PM, Greg Willits wrote:
> Exploring options... wondering if there's anything that can replace
> marshaling that's similar in usage (dump & load to/from disk file),
> but
> faster than the native implementation in Ruby 1.8.6
>
> I can explain some details if necessary, but in short:
>
> - I need to marshal,
>
> - I need to swap data sets often enough that performance
> will be a problem (currently it can take several seconds to restore
> some marshaled data -- way too long)
>
> - the scaling is such that more RAM per box is costly enough to pay
> for
> development of a more RAM efficient design
>
> - faster performance Marshaling is worth asking about to see how much
> it'll get me.
>
> I'm hoping there's something that's as close to a memory space dump &
> restore as possible -- no need to "reconstruct" data piece by piece
> which Ruby seems to be doing now. It takes < 250ms to load an 11MB raw
> data file via readlines, and 2 seconds to load a 9MB sized Marshal
> file,

readlines? not read? readlines should be used for text, not binary
data. Also, supplying an IO to Marshal.load instead of a pre-read
String adds about 30% overhead for constant calls to getc.

9MB seems like a lot of data to load, how many objects are in the
dump? Do you really need to load a set of objects that large?

> so clearly Ruby is busy rebuilding stuff rather than just pumping a
> RAM
> block with a binary image.

Ruby is going to need to call allocate for each object in order to
register with the GC and build the proper object graph. I doubt
there's a way around this without extensive modification to ruby.

David Masover

7/27/2008 3:29:00 AM

On Saturday 26 July 2008 21:58:22 Greg Willits wrote:

> - I need to swap data sets often enough that performance
> will be a problem (currently it can take several seconds to restore
> some marshaled data -- way too long)

Why do you need to do this yourself?

> - the scaling is such that more RAM per box is costly enough to pay for
> development of a more RAM efficient design

What about more swap per box?

It might be slower, maybe not, but it seems like the easiest thing to try.

Another possibility would be to use something like ActiveRecord -- though you
probably want something much more lightweight (suggestions? I keep forgetting
what's out there...) -- after all, you probably aren't operating on the whole
dataset at once, so what you really want is something reasonably fast at
loading/saving individual objects.

Greg Willits

7/27/2008 5:07:00 AM

Eric Hodel wrote:
> On Jul 26, 2008, at 19:58 PM, Greg Willits wrote:
>> will be a problem (currently it can take several seconds to restore
>> restore as possible -- no need to "reconstruct" data piece by piece
>> which Ruby seems to be doing now. It takes < 250ms to load an 11MB raw
>> data file via readlines, and 2 seconds to load a 9MB sized Marshal
>> file,

> Ruby is going to need to call allocate for each object in order to
> register with the GC and build the proper object graph. I doubt
> there's a way around this without extensive modification to ruby.

Hmm. makes sense of course, was just hoping someone had a clever
replacement.

I'll just have to try clever code that minimizes the frequency of
re-loads.

If you're curious about the back story, I've explained it more below.

> readlines? not read? readlines should be used for text, not binary
> data. Also, supplying an IO to Marshal.load instead of a pre-read
> String adds about 30% overhead for constant calls to getc.

Wasn't using it on binary data -- was just making a note that an 11MB
tab file (about 45,000 lines) took all of 250ms (actually 90ms on my
server drives) to read into an array using readlines. Whereas loading a
marshaled version of that same data (reorganized, and saved as an array
of hashes) from a file that happened to be 9Mb took almost 2 seconds --
so there's clearly a lot of overhead in re-storing a marshaled object.
That was my point.

> 9MB seems like a lot of data to load, how many objects are in the
> dump? Do you really need to load a set of objects that large?

Yes, and that's not the largest, but it's an average. Range is 1 MB to
30 MB of raw data per file. A few are 100+ MB, one is 360 MB on it's
own, but it's an exception.

This is a data aggregation framework. One generic framework will run as
multiple app-specific instances where each application has a data set of
4-8GB of raw text data (from 200-400 files). That raw data is loaded,
reorganized into standardized structures, and one or more indexes
generated per original file.

One application instance per server. The server is used as a workgroup
intranet web server by day (along with it's redundant twin), and as an
aggregator by night.

That 9MB Marshaled file is the result of one data source of 45,000 lines
being re-arranged, each data element cleansed and transformed, and then
stored as an array of hashes. An index is stored as a separate Marshaled
file so it can be loaded independently.

Those 300 or so original files, having been processed and indexed, are
now searched and combined in a complex aggregation (sadly, not just
simple mergers) which nets a couple dozen tab files for LOAD DATA into a
database for the web app.

Based on a first version of this animal, spikes on faster hardware,
accounting for new algorithms and growth in data sizes, this process
will take several hours even on a new intel server even with everything
loaded into RAM. And that's before we start to add a number of new
tasks to the application.

Of course, we're looking at ways to split the processing to take
advantage of multiple cores, but that just adds more demand on memory
(DRb way too slow by a couple orders of magnitude to consider using as a
common "memory" space" for all cores).

The aggregation is complex enough that in a perfect world, I'd have the
entire data set in RAM all at once, because any one final data table
pulls it's data from numerous sources and alternate sources if the
primary doesn't have it on a field by field basis. field1 comes from
sourceX, field2 from sourceA, and sourceB if A doesn't have it. It gets
hairy :-)

Unlike a massive web application where any one transaction can take as
long as 1 second even 2 to complete, and you throw more machines at it
to handle increase in requests, this is a task trying to get tens of
millions of field tranformations, and millions of hash reads completed
linearly as quickly as possible. So, the overhead of DRb and similar
approaches aren't good enough.

David Masover wrote:
>> - I need to swap data sets often enough that performance
>> will be a problem (currently it can take several seconds to restore
>> some marshaled data -- way too long)
>
> Why do you need to do this yourself?

As a test, I took that one 9MB sample file mentioned above, and loaded
it as 6 unique objects to see how long that would take, and how much RAM
would get used -- Ruby ballooned into using 500MB of RAM. In theory I
would like to have every one of those 300 files in meory, but
logistically I can easily get away with 50 to 100 at once. But if Ruby
is going to balloon that massively, I won't even get close to 50 such
data sets in RAM at once. So, I "need" to be able to swap data sets in &
out of RAM as needed (hopefully with an algorithm that minimizes the
swapping by processing batches which all refc the same loaded data
sets).

>> - the scaling is such that more RAM per box is costly enough to pay for
>> development of a more RAM efficient design
>
> What about more swap per box? It might be slower, maybe not, but it seems
> like the easiest thing to try.

More "swap"? You mean virtual memory? I may be wrong, but I am assuming
regardless of how effective VM is, I can easily saturate real RAM, and
it's been my experience that systems just don't like all of their real
RAM full.

Unless there's some Ruby commands to tell it to specifically push
objects into the OS's VM, I think I am stuck having to manage RAM
consumption on my own. ??

> Another possibility would be to use something like ActiveRecord --

Using the db especially through AR would be glacial. We have a db-based
process now, and need something faster.

-- gw

--
Posted via http://www.ruby-....

David Masover

7/27/2008 5:18:00 AM

On Sunday 27 July 2008 00:07:10 Greg Willits wrote:

> >> - the scaling is such that more RAM per box is costly enough to pay for
> >> development of a more RAM efficient design
> >
> > What about more swap per box? It might be slower, maybe not, but it seems
> > like the easiest thing to try.
>
> More "swap"? You mean virtual memory? I may be wrong, but I am assuming
> regardless of how effective VM is, I can easily saturate real RAM, and
> it's been my experience that systems just don't like all of their real
> RAM full.

In general, yes. However, if this is all the system it's doing, I'm suggesting
that it may be useful -- assuming there isn't something else that makes this
impractical, like garbage collection pulling everything out of RAM to see if
it can be collected. (I don't know enough about how Ruby garbage collection
works to know if this is a problem.)

But then, given the sheer size problem you mentioned earlier, it probably
wouldn't work well.

> > Another possibility would be to use something like ActiveRecord --
>
> Using the db especially through AR would be glacial. We have a db-based
> process now, and need something faster.

I specifically mean something already designed for this purpose -- not
necessarily a traditional database. Something like berkdb, or "stone" (I
think that's what it was called) -- or splitting it into a bunch of files, on
a decent filesystem.

Greg Willits

7/27/2008 5:34:00 AM

David Masover wrote:
> On Sunday 27 July 2008 00:07:10 Greg Willits wrote:

>> > Another possibility would be to use something like ActiveRecord --
>>
>> Using the db especially through AR would be glacial. We have a db-based
>> process now, and need something faster.
>
> I specifically mean something already designed for this purpose -- not
> necessarily a traditional database. Something like berkdb, or "stone" (I
> think that's what it was called) -- or splitting it into a bunch of
> files, on a decent filesystem.

Berkely DB has been sucked up by Oracle, and I don't think it ever ran
on OS X anyway.

We have talked about skipping Marshaling and going straight to standard
text files on disk and then using read commands that point to a specific
file line.

We haven't spiked that yet, but I don't see it being significantly
faster than using a local db (especially since db cache might be
useful), but it's something we'll probably at least investigate just to
prove it's comparative performance. It might be faster just because we
can keep all indexes in RAM. Get some 15,000 rpm drives, probably
implement some caching to reduce disk reads

So, yeah, maybe that or even sqlite might be suitable if the RAM thing
just gets too obnoxious to solve. Something that would prove to be
faster than MySQL.

-- gw

--
Posted via http://www.ruby-....

David Masover

7/27/2008 5:44:00 AM

On Sunday 27 July 2008 00:33:56 Greg Willits wrote:

> We have talked about skipping Marshaling and going straight to standard
> text files on disk and then using read commands that point to a specific
> file line.

If the files aren't changing, you probably want to seek to a specific byte
offset in the file, rather than a line -- the latter requires you to read
through the entire file up to that line.

> We haven't spiked that yet, but I don't see it being significantly
> faster than using a local db (especially since db cache might be
> useful),

More useful than the FS cache?

> So, yeah, maybe that or even sqlite might be suitable if the RAM thing
> just gets too obnoxious to solve. Something that would prove to be
> faster than MySQL.

For what it's worth, ActiveRecord does work on SQLite. So does Sequel, and I
bet DataMapper does, too.

I mentioned BerkDB because I assumed it would be faster than SQLite -- but
that's a completely uninformed guess.

Joel VanderWerf

7/27/2008 6:52:00 AM

Greg Willits wrote:
>>> - the scaling is such that more RAM per box is costly enough to pay for
>>> development of a more RAM efficient design
>> What about more swap per box? It might be slower, maybe not, but it seems
>> like the easiest thing to try.
>
> More "swap"? You mean virtual memory? I may be wrong, but I am assuming
> regardless of how effective VM is, I can easily saturate real RAM, and
> it's been my experience that systems just don't like all of their real
> RAM full.

More swap might help, if you assign one ruby process per data set. Then
switching data sets means just letting the vm swap in a different
process, if it needs to.

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Robert Klemme

7/27/2008 10:22:00 AM

On 27.07.2008 07:44, David Masover wrote:
> On Sunday 27 July 2008 00:33:56 Greg Willits wrote:
>
>> We have talked about skipping Marshaling and going straight to standard
>> text files on disk and then using read commands that point to a specific
>> file line.
>
> If the files aren't changing, you probably want to seek to a specific byte
> offset in the file, rather than a line -- the latter requires you to read
> through the entire file up to that line.

Array#pack and String#unpack come to mind. But IMHO this is still
inferior to using a relational database because in the end it comes down
to reimplementing the same mechanisms that are present there already.

> For what it's worth, ActiveRecord does work on SQLite. So does Sequel, and I
> bet DataMapper does, too.

But keep in mind that AR and the like introduce some overhead of
themselves. It might be faster to just use plain SQL to get at the data.

But given the problem description I would definitively go for a
relational or other database system. There is no point in inventing the
wheel (aka fast indexing of large data volumes on disk) yourself. You
might even check RAA for an implementation of B-trees.

Kind regards

robert

Robert Klemme

7/27/2008 10:32:00 AM

On 27.07.2008 12:21, Robert Klemme wrote:

> But given the problem description I would definitively go for a
> relational or other database system. There is no point in inventing the
> wheel (aka fast indexing of large data volumes on disk) yourself. You
> might even check RAA for an implementation of B-trees.

Just after sending I remembered a thread in another newsgroup. The
problem sounds a bit related to yours and eventually the guy ended up
using CDB:

http://cr.yp.t...

There's even a Ruby binding:

http://raa.ruby-lang.org/pr...

His summary is here, the problem description is at the beginning of the
thread:

http://groups.google.com/group/comp.unix.programmer/msg/420c2c...

Kind regards

robert

James Gray

7/27/2008 2:33:00 PM

On Jul 27, 2008, at 12:33 AM, Greg Willits wrote:

> So, yeah, maybe that or even sqlite might be suitable if the RAM thing
> just gets too obnoxious to solve.

I would be shocked if SQLite can't be made to solve the problem well
with the right planning. That little database is always surprising
me. Don't forget to look into the following two features as it sounds
like they may be helpful in this case:

* In memory databases
* Attaching multiple SQLite files to perform queries across them

James Edward Gray II

comp.lang.ruby

Faster Marshaling?

Greg Willits

Eric Hodel

David Masover

Greg Willits

David Masover

Greg Willits

David Masover

Joel VanderWerf

Robert Klemme

Robert Klemme

James Gray

x Login to ForumsZone