nate
11/9/2007 10:09:00 PM
M. Edward (Ed) Borasky wrote:
> Processes in the "D" state for significant periods of time tell me that
> there *is* an I/O problem. Are these Ruby processes? Are other
> processes/kernel threads, for example, "kswapd" or "bdflush" also in a
> "D" state?
Normally I would agree, and no there is not, running iostat on the box
as well says the disks are 99.99% or 100% idle. Swap usage is 0 as well.
Which is what gets me so dumbfounded. The I/O subsystems on these
boxes are fast as hell(waaaaaay overkill). And the apps hardly use
disk at all short of writing log files(and sometimes ruby session
files).
> If you haven't already done so, install the "sysstat" RPM. That will
> give you a marvelous tool called "iostat". Run "iostat -x -t" with
> ten-second samples and collect the logs in a file. When your system
> starts getting "D" state processes, look at the time stamp and compare
> it with the "iostat" logs. If you see one or more disks with a
> utilization of 100 percent, you're waiting for disk.
Agree there too, and already done, sorry if I didn't clarify in the
original email! Top shows a lot of I/O wait as does sar(as far as
CPU I/O wait) but it seems just to be an artifact of the defunct
processes as there is no other signs that I/O is being used.
> Another thing to look at is those memory leaks. I don't remember what
> Fedora Core 4 used for a kernel, but CentOS 4.5 is a 2.6.9 kernel IIRC.
> The memory leaks may be starving the system of memory. "vmstat" will
> tell you that -- if you see lots of swap in and swap out, your memory
> leaks are making your system thrash.
Agree there too and there is no signs of swapping, the systems have 8GB
of memory and seem to run in the 3-5GB range. The biggest app does have
a memory leak and the system auto restarts it once the memory usage for
the app exceeds 5GB.
> Well ... I wouldn't say grasping at straws so much as trying to many
> things at once to really isolate the problem. You've got way too many
> variables and factors at play here. If you've got one system /
> configuration that's stable, clone it and quit futzing with it for a
> week or so. :) Then make one change at a time, so you know what broke it!
Yeah that's a good point, However honestly even having been managing
stuff for as long as I have, this sort of change seemed fairly
trivial, I mean the OS is almost the same, same major version of
apache, same version of fastcgi, etc.. And it worked fine in test
environments for a month. The developers are pushing hard to
upgrade to ruby 1.8.6 in the next couple of weeks..so that'll be
another change. But in general yeah I do like the approach of
change one thing at a time, I just was getting so frustrated with
having such an obsolete platform in our environment still(when
most everything else has been moved off). And the failure to
deploy 64-bit ruby(due to load issues), I just wanted to get it
done.
Worst case I can roll back to the other OS in a couple hours,
however rebuilding another version of ruby, and the associated
gems/addons, then packaging them as RPMs and adjusting CFEngine
to push them out is about a 16-24 hour job, not something I
look forward to repeating very often(just did it for Ruby 1.8.6)
> Incidentally, memory leaks are defects -- period. Either the
> infrastructure or the application code is broken. If it's your
> application code, get your developers busy fixing it. If it's open
> source code, bring it to the attention of the relevant community.
Agree again there, good points all around. The developers have been
attentive to the memory leaks but so far haven't done a whole lot.
They do get the emails about the auto restarts of the apps when
they do exhaust their memory. One developer here brought up a
big memory leak fix in the hash(?) routines in ruby 1.8.6 which
is part of the push to upgrade, to see if that helps. Turns out
in investigating this problem I dug into the version of Ruby
1.8.5 and it turns out Red hat back ported that particular fix
to 1.8.5, and so far the memory usage lines are similar to 1.8.4,
so it seems that particular leak didn't impact us as much as we
might of hoped.
We've also spent some time on a memory leak profiler, forget the
name off hand but it runs under 1.8.6, to my knowledge we haven't
had a lot of success getting it working(not on the dev side of
things so they have made progress since I last got an update).
> I think that's *exactly* what you need to do -- roll back to a stable
> configuration and introduce changes one at a time.
That may be what I end up doing, really there isn't much that has
changed:
- Apache same version (2 minor revs earlier)
- OS update (fairly significant forklift)
- Ruby update
- Fastcgi same version (compiled against new OS)
So I could head the route of running 1.8.5 ruby on the older OS, but
I do dread the amount of work that entails, especially when we're
being pushed to 1.8.6 in a matter of 2, maybe 3 weeks.
Appreciate the thoughts, mostly confirming what I already suspected,
was hoping for a miracle that would save me a lot more work, but I
knew it probably wasn't going to happen, but you(hopefully) can't
blame me for hoping :)
nate