Robert Klemme
11/17/2007 8:20:00 PM
On 17.11.2007 15:13, Jano Svitok wrote:
> On Nov 17, 2007 7:58 AM, Ray Chen <ray.c.chen@gmail.com> wrote:
>> My process memory usage has been increasing steadily, and some probing
>> pointed me to REXML. I created a test that consisted of feeding 10 xml
>> files ranging in size from 15kB to 270kB to REXML::Document.new(). The
>> files are fed smallest to largest. I would think that memory usage
>> should return back to ~8 MB since the REXML::Document should go out of
>> scope, and everything should get garbage-collected.
>>
>> Is there something wrong with my understanding of Ruby or does REXML
>> hold onto memory?
>
> You can get marginally better by replacing
>> #create the string
>> f = File.open("/tmp/#{i}.xml", 'r')
>> str = ''
>>
>> while line = f.gets
>> str << line
>> end
>> f.close
>
> with
>
> str = File.read("/tmp/#{i}.xml")
There is an even better method for reading XML documents:
doc = File.open("/tmp/#{i}.xml", 'rb') {|io| REXML::Document.new io}
No need to read the whole file into a large string before it is parsed
as XML.
> NB: The your version would be better written (with regards to
> exception safety etc.) as:
>
> str = ''
> File.open("/tmp/#{i}.xml", 'r') do |f|
> while line = f.gets
> str << line
> end
> end
If I would be doing the reading myself I'd choose #read over #gets. The
reason is that line reading is a form of parsing the input and that
should be left to the XML parser.
>> #construct the xml
>> xml = REXML::Document.new(str)
>> xml = nil
>>
>> return nil
>> end
>
> As Robert said, there are more things happening. One of them is that
> ruby allocates memory in increasing heap blocks.
> If anything used is still inside the block, the block won't be
> released to system.
>
> I tried to reuse one string as a buffer for the file, but it didn't
> help [see IO#read(lenght, buffer)]. Other thing I tried was to
> send the file itself to REXML::Document.new, but it was even worse [
> File.open(...) {|f| REXML::Doc.new(f) }].
Really? Interesting. This is the form I would prefer for the simple
reason that at no point in time there are two copies of the file in
memory. A quick test reveals that the total memory of a process using
this idiom is higher than using the other idiom.
$ ruby -r rexml/document -e
'd=File.open("Anwendungsdaten/Skype/shared.xml","rb") {|io|
REXML::Document.new io};sleep 10'
-> 4924kb
$ ruby -r rexml/document -e
'd=REXML::Document.new(File.read("Anwendungsdaten/Skype/shared.xml"));sleep
10'
-> 4876kb
$ du -k Anwendungsdaten/Skype/shared.xml
28 Anwendungsdaten/Skype/shared.xml
This is ruby 1.8.5 on cygwin on Win XP SP2. I'd probably still stick
with the former approach since it seems more reasonable to let the
parser read from the IO and not from a string and the difference is not
too big.
> This is on win xp sp2.
>
> You can find on the net some tools to find out what consumes the
> memory - but most of them are in the
> hacks category (no offense!). On windows there is the Ruby Memory
> Validator that does a similar job.
Thanks for the hint.
Cheers
robert