Asp Forum - Interesting garbage collection article on LTU

Lionel Bouton

8/11/2007 11:14:00 PM

Hi,

I just read the abstract on http://lambda-the-ultimate.org... and
found it might be of interest for core language developers and people
interested in garbage collecting.

I immediately thought of Ruby because of my experience with long running
Rails processes. It might be the occasion to have my gut feelings
checked by people that really know the inner workings of Ruby 1.8 too.

1/ From my experience the memory usage of a Ruby process never goes down
(garbage collected objects only free memory for future reallocation, but
the process memory is never recompacted after garbage collecting).
2/ The VM load can remain high after big allocations and subsequent
garbage collections (ie: there's some pagging going on even if the
application isn't doing memory intensive operations anymore).

The article above might just have explained to me why Ruby behaves this
way: the GC might just force paging once the memory footprint of a
process is large enough, even if most of this process-allocated memory
is actually free room.

Now going to review the content more in depth.

Lionel.

10 Answers

M. Edward (Ed) Borasky

8/12/2007 1:39:00 AM

Lionel Bouton wrote:
> Hi,
>
> I just read the abstract on http://lambda-the-ultimate.org... and
> found it might be of interest for core language developers and people
> interested in garbage collecting.
>
> I immediately thought of Ruby because of my experience with long running
> Rails processes. It might be the occasion to have my gut feelings
> checked by people that really know the inner workings of Ruby 1.8 too.
>
> 1/ From my experience the memory usage of a Ruby process never goes down
> (garbage collected objects only free memory for future reallocation, but
> the process memory is never recompacted after garbage collecting).
> 2/ The VM load can remain high after big allocations and subsequent
> garbage collections (ie: there's some pagging going on even if the
> application isn't doing memory intensive operations anymore).
>
> The article above might just have explained to me why Ruby behaves this
> way: the GC might just force paging once the memory footprint of a
> process is large enough, even if most of this process-allocated memory
> is actually free room.
>
> Now going to review the content more in depth.
>
> Lionel.
>
>
I guess I need to go read the article first, but clearly number 2 is
going to be operating system and C language run time dependent. Ruby
isn't the only garbage-collected language that can have bad interactions
with an OS -- I've seen nasty things happen in the R Windows
implementation, for example, even though the garbage collector there
*has* brought the overall process/thread memory allocation down.

And Linux is notoriously proud of the mantra "free memory is wasted
memory". It usually won't kick a page out to disk until it needs it for
something else. So the resident set size of a Linux process/thread may
have little or nothing to do with its actual working set.

John Joyce

8/12/2007 3:24:00 AM

On Aug 11, 2007, at 8:38 PM, M. Edward (Ed) Borasky wrote:

> Lionel Bouton wrote:
>> Hi,
>>
>> I just read the abstract on http://lambda-the-ultimate...
>> 2391 and
>> found it might be of interest for core language developers and people
>> interested in garbage collecting.
>>
>> I immediately thought of Ruby because of my experience with long
>> running
>> Rails processes. It might be the occasion to have my gut feelings
>> checked by people that really know the inner workings of Ruby 1.8
>> too.
>>
>> 1/ From my experience the memory usage of a Ruby process never
>> goes down
>> (garbage collected objects only free memory for future
>> reallocation, but
>> the process memory is never recompacted after garbage collecting).
>> 2/ The VM load can remain high after big allocations and subsequent
>> garbage collections (ie: there's some pagging going on even if the
>> application isn't doing memory intensive operations anymore).
>>
>> The article above might just have explained to me why Ruby behaves
>> this
>> way: the GC might just force paging once the memory footprint of a
>> process is large enough, even if most of this process-allocated
>> memory
>> is actually free room.
>>
>> Now going to review the content more in depth.
>>
>> Lionel.
>>
>>
> I guess I need to go read the article first, but clearly number 2 is
> going to be operating system and C language run time dependent. Ruby
> isn't the only garbage-collected language that can have bad
> interactions
> with an OS -- I've seen nasty things happen in the R Windows
> implementation, for example, even though the garbage collector there
> *has* brought the overall process/thread memory allocation down.
>
> And Linux is notoriously proud of the mantra "free memory is wasted
> memory". It usually won't kick a page out to disk until it needs it
> for
> something else. So the resident set size of a Linux process/thread may
> have little or nothing to do with its actual working set.
>
Might it not have something to do with the liberal use of symbols in
Rails?
AFAIK symbols don't get GC'd, or do they?

Gregory Brown

8/12/2007 3:38:00 AM

On 8/11/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:

> > And Linux is notoriously proud of the mantra "free memory is wasted
> > memory". It usually won't kick a page out to disk until it needs it
> > for
> > something else. So the resident set size of a Linux process/thread may
> > have little or nothing to do with its actual working set.
> >
> Might it not have something to do with the liberal use of symbols in
> Rails?
> AFAIK symbols don't get GC'd, or do they?

It's true that symbols can create memory leaks[0], but the tendency of
Rails to hog memory is mostly because its a big system that hasn't
necessarily been designed for performance. Creating a bunch of
temporary objects without discarding them properly, using Array#shift
in Ruby 1.8, or leaving references to objects open in closures without
realizing it is more likely to make up a much more significant part of
what causes processes to balloon.

Also keep in mind that Rails keeps quite a bit of objects alive just
in normal use. This is not something that will be easy to change.

That having been said, I'm sure there are places in the code base when
something like:

something.to_s == "foo"

could be used instead of:

something.to_sym == :foo

This would not be my first place to look for memory optimizations in
Rails though.

Saving my criticism for last, the OP as well as Ed didn't mention
Rails, and Ruby != Rails. :)

[0] http://www.oreillynet.com/ruby/blog/2006/04/nubygems_symbolic_starv...

M. Edward (Ed) Borasky

8/12/2007 3:50:00 AM

Gregory Brown wrote:
> On 8/11/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:
>
>>> And Linux is notoriously proud of the mantra "free memory is wasted
>>> memory". It usually won't kick a page out to disk until it needs it
>>> for
>>> something else. So the resident set size of a Linux process/thread may
>>> have little or nothing to do with its actual working set.
>>>
>> Might it not have something to do with the liberal use of symbols in
>> Rails?
>> AFAIK symbols don't get GC'd, or do they?
>
> It's true that symbols can create memory leaks[0], but the tendency of
> Rails to hog memory is mostly because its a big system that hasn't
> necessarily been designed for performance. Creating a bunch of
> temporary objects without discarding them properly, using Array#shift
> in Ruby 1.8, or leaving references to objects open in closures without
> realizing it is more likely to make up a much more significant part of
> what causes processes to balloon.
>
> Also keep in mind that Rails keeps quite a bit of objects alive just
> in normal use. This is not something that will be easy to change.
>
> That having been said, I'm sure there are places in the code base when
> something like:
>
> something.to_s == "foo"
>
> could be used instead of:
>
> something.to_sym == :foo
>
> This would not be my first place to look for memory optimizations in
> Rails though.
>
> Saving my criticism for last, the OP as well as Ed didn't mention
> Rails, and Ruby != Rails. :)
>
> [0] http://www.oreillynet.com/ruby/blog/2006/04/nubygems_symbolic_starv...
>
>
Well, I downloaded the PDF. It turns out they interact directly with the
OS memory manager, in this case the frightfully ancient 2.4.20 Linux
kernel. Even RHEL 3, the last Red Hat Linux to use a 2.4 kernel, has a
better memory manager than that! But yeah, it's more or less mandatory
that you talk to the OS. Also, they were benchmarking on the Jikes JVM
from IBM, not the Sun JVM and not *any* Ruby.

If I get that far with my profiling, I'll certainly take a look at
potential hacks for tuning the Ruby GC against a *real* kernel (2.6.22
and counting.) :) But there are other interesting places in Ruby to tune
besides the garbage collector, and I think KRI has a better GC anyhow.

John Joyce

8/12/2007 5:48:00 AM

On Aug 11, 2007, at 10:38 PM, Gregory Brown wrote:

> On 8/11/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:
>
>>> And Linux is notoriously proud of the mantra "free memory is wasted
>>> memory". It usually won't kick a page out to disk until it needs it
>>> for
>>> something else. So the resident set size of a Linux process/
>>> thread may
>>> have little or nothing to do with its actual working set.
>>>
>> Might it not have something to do with the liberal use of symbols in
>> Rails?
>> AFAIK symbols don't get GC'd, or do they?
>
> It's true that symbols can create memory leaks[0], but the tendency of
> Rails to hog memory is mostly because its a big system that hasn't
> necessarily been designed for performance. Creating a bunch of
> temporary objects without discarding them properly, using Array#shift
> in Ruby 1.8, or leaving references to objects open in closures without
> realizing it is more likely to make up a much more significant part of
> what causes processes to balloon.
>
> Also keep in mind that Rails keeps quite a bit of objects alive just
> in normal use. This is not something that will be easy to change.
>
> That having been said, I'm sure there are places in the code base when
> something like:
>
> something.to_s == "foo"
>
> could be used instead of:
>
> something.to_sym == :foo
>
> This would not be my first place to look for memory optimizations in
> Rails though.
>
> Saving my criticism for last, the OP as well as Ed didn't mention
> Rails, and Ruby != Rails. :)
>
> [0] http://www.oreillynet.com/ruby/blo...
> nubygems_symbolic_starvation.html
>
The OP DID mention [long running Rails processes] !
OP Quote:
"I immediately thought of Ruby because of my experience with long
running
Rails processes. It might be the occasion to have my gut feelings
checked by people that really know the inner workings of Ruby 1.8 too."

I know Ruby does not equal Rails.
Regardless, Rails should not be ignored completely. Much has been
given to Ruby through Rails.
Don't be too hasty to discount anything that mentions Rails, it is
written in Ruby, and when it is a Ruby-centric question, it applies
here. Save your criticism.

Lionel Bouton

8/12/2007 1:55:00 PM

John Joyce wrote the following on 12.08.2007 07:47 :
> The OP DID mention [long running Rails processes] !
> OP Quote:
> "I immediately thought of Ruby because of my experience with long running
> Rails processes. It might be the occasion to have my gut feelings
> checked by people that really know the inner workings of Ruby 1.8 too."
>

To be more accurate on my experience: I have a Rails application with
the usual CRUD behaviour which isn't especially memory intensive
(processes happily sit around 30-40MB, which seems the minimum for a
Rails application). But there are some actions that process incoming CSV
files with a rather bad memory behaviour. The process size jumps as soon
as these actions are called with a size roughly proportionnal to the CSV
line count.

In practice this means process sizes jumping to 100-150 MB with the
current usage pattern. This particular problem can be corrected through
a different algorithm but this illustrates quite well what happens when
Ruby must allocate lots of memory:
- once the action is done, all objects can be freed (no nasty reference
can lie around, this is a rather naive and simple algorithm that just
allocates too much objects).
- though the excess memory should be freeable the Ruby process size
remains stable at the peak value.
- once the total size of Ruby process approach the available memory on
the system, paging occur (even if the users don't hit the CSV processing
methods and so don't allocate much).

This is this paging which unintuitively last though no real application
memory pressure exists that I believe could be linked to the problem
described in the article.

In fact previously I thought about handing these memory intensive
actions to BackgroundRB which I believe now uses processes killed on
task completion. This means that even if the process size becomes huge,
it wouldn't be more than a temporary problem, not one that might push my
servers in swapping Hell.

Previously I thought that the only proper solution was to recompact the
memory after garbage collection (although I don't know the specifics, it
might be unworkable for various reasons), but the bookmarking concept
may be easier to implement. Even if it doesn't seem really right (at
least for me) to have this sort of communication between the OS and the
application regarding memory management, garbage collection breaks some
assumptions that the VM makes so maybe it's unavoidable...

Lionel.

Gregory Brown

8/12/2007 2:36:00 PM

On 8/12/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:

> The OP DID mention [long running Rails processes] !
> OP Quote:
> "I immediately thought of Ruby because of my experience with long
> running
> Rails processes. It might be the occasion to have my gut feelings
> checked by people that really know the inner workings of Ruby 1.8 too."
>
> I know Ruby does not equal Rails.
> Regardless, Rails should not be ignored completely. Much has been
> given to Ruby through Rails.
> Don't be too hasty to discount anything that mentions Rails, it is
> written in Ruby, and when it is a Ruby-centric question, it applies
> here. Save your criticism.

Sorry. I missed that. I was reading the response you cropped which
contained no reference to Rails and then glanced upwards at the
original post and didn't see that line. That having been said, I
hope the other things I mentioned about memory management were
helpful, and were an indication that I wasn't discounting your
question.

Gregory Brown

8/12/2007 2:58:00 PM

On 8/12/07, Lionel Bouton <lionel-subscription@bouton.name> wrote:
> John Joyce wrote the following on 12.08.2007 07:47 :
> > The OP DID mention [long running Rails processes] !
> > OP Quote:
> > "I immediately thought of Ruby because of my experience with long running
> > Rails processes. It might be the occasion to have my gut feelings
> > checked by people that really know the inner workings of Ruby 1.8 too."
> >
>
> To be more accurate on my experience: I have a Rails application with
> the usual CRUD behaviour which isn't especially memory intensive
> (processes happily sit around 30-40MB, which seems the minimum for a
> Rails application). But there are some actions that process incoming CSV
> files with a rather bad memory behaviour. The process size jumps as soon
> as these actions are called with a size roughly proportionnal to the CSV
> line count.

Are you loading the CSVs entirely into memory or processing them line
by line? With a large CSV, if you process it line by line (even if
you're going to ultimately store it), you're less likely to hit the
same kind of memory spike you'd get loading it entirely into memory.

This is a convoluted example, but in Ruport, this code, which
ultimately breaks down a table of records into a simple array of
arrays takes a lot of memory (110mb for 50000 lines ):

a = Table("hygjan2007.csv").map { |r| r.to_a }

Where this one that does the conversion on the fly takes a whole lot less(67mb):

>> a = []
=> []
>> Table("hygjan2007.csv", :records => true) do |t,r|
?> a << r.to_a
>> end

Of course, this isn't a practical Ruport example, I'm intentionally
ballooning the memory by doing the unnecessary record conversion just
to use primitive objects for the ultimate storage (an array of
arrays).

Now, I'm pretty sure since the end result of those two code samples
are the same, that when it was time to free up that memory in a
crunch, they'd compress down to the same size. But if you don't want
it to spike in the first place, anywhere you can introduce row
processing for your CSVs instead of slurping would be a good thing.

If you're already doing that, this advice isn't very helpful to you,
but may be to others.

-greg

Robert Klemme

8/13/2007 9:12:00 AM

2007/8/12, Lionel Bouton <lionel-subscription@bouton.name>:
> I just read the abstract on http://lambda-the-ultimate.org... and
> found it might be of interest for core language developers and people
> interested in garbage collecting.

Thanks for the link! Will read once I find the time.

> I immediately thought of Ruby because of my experience with long running
> Rails processes. It might be the occasion to have my gut feelings
> checked by people that really know the inner workings of Ruby 1.8 too.
>
> 1/ From my experience the memory usage of a Ruby process never goes down
> (garbage collected objects only free memory for future reallocation, but
> the process memory is never recompacted after garbage collecting).
> 2/ The VM load can remain high after big allocations and subsequent
> garbage collections (ie: there's some pagging going on even if the
> application isn't doing memory intensive operations anymore).

I guess that is one of the core reasons why Java's GC copies objects.
That way you compact the heap and thus ensure that it will sit on less
pages.

> The article above might just have explained to me why Ruby behaves this
> way: the GC might just force paging once the memory footprint of a
> process is large enough, even if most of this process-allocated memory
> is actually free room.
>
> Now going to review the content more in depth.

Kind regards

robert

John Joyce

8/13/2007 5:26:00 PM

Make sure you're not slurping the CSV file's contents after uploading
it.
slurping is fine if you know that data will always be of a limited
size you can handle, but even then, it's generally better to create a
temp file and process things in parts.
Load part
Process
Save part
repeat until done
check data integrity (make sure it's not screwed up)
then change temp file name and delete original (or delete both, if
the CSV data is going into a database)

This kind of thing pops up pretty regularly here lately.
It's just safe. It keeps you from running out of resources on the
system, but also protects data integrity while processing in the
event of some kind of interruption. (power outage, process killed,
etc...)
It's a little more work, and may seem like overkill for small files,
but it's a lot more reliable.

But remember to get some sample test data too, CSV doesn't mean short
lines always. One line could be ridiculously long, so you may need to
check the data first.
This kind of subroutine could get pretty involved and depends a lot
on data sources.
But if the data is important you'll be glad you did all this rather
than have one instance of completely munged data that is not
recoverable.

comp.lang.ruby

Interesting garbage collection article on LTU

Lionel Bouton

M. Edward (Ed) Borasky

John Joyce

Gregory Brown

M. Edward (Ed) Borasky

John Joyce

Lionel Bouton

Gregory Brown

Gregory Brown

Robert Klemme

John Joyce

x Login to ForumsZone