[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Please help with MemoryError

Jeremy

2/11/2010 11:39:00 PM

I have been using Python for several years now and have never run into
memory errors…

until now.

My Python program now consumes over 2 GB of memory and then I get a
MemoryError. I know I am reading lots of files into memory, but not
2GB worth. I thought I didn't have to worry about memory allocation
in Python because of the garbage collector. On this note I have a few
questions. FYI I am using Python 2.6.4 on my Mac.

1. When I pass a variable to the constructor of a class does it
copy that variable or is it just a reference/pointer? I was under the
impression that it was just a pointer to the data.
2. When do I need to manually allocate/deallocate memory and when
can I trust Python to take care of it?
3. Any good practice suggestions?

Thanks,
Jeremy
34 Answers

Alf P. Steinbach

2/12/2010 12:04:00 AM

0

* Jeremy:
> I have been using Python for several years now and have never run into
> memory errors?
>
> until now.
>
> My Python program now consumes over 2 GB of memory and then I get a
> MemoryError. I know I am reading lots of files into memory, but not
> 2GB worth. I thought I didn't have to worry about memory allocation
> in Python because of the garbage collector. On this note I have a few
> questions. FYI I am using Python 2.6.4 on my Mac.
>
> 1. When I pass a variable to the constructor of a class does it
> copy that variable or is it just a reference/pointer? I was under the
> impression that it was just a pointer to the data.

Uhm, I discovered that "pointer" is apparently a Loaded Word in the Python
community. At least in some sub-community, so, best avoided. But essentially
you're just passing a reference to an object. The object is not copied.


> 2. When do I need to manually allocate/deallocate memory and when
> can I trust Python to take care of it?

Python takes care of deallocation of objects that are /no longer referenced/.


> 3. Any good practice suggestions?

You need to get rid of references to objects before Python will garbage collect
them.

Typically, in a language like Python (or Java, C#...) memory leaks are caused by
keeping object references in singletons or globals, e.g. for purposes of event
notifications. For example, you may have some dictionary somewhere.

Such references from singletons/globals need to be removed.

You do not, however, need to be concerned about circular references, at least
unless you need some immediate deallocation.

For although circular references will prevent the objects involved from being
immediately deallocated, the general garbage collector will take care of them later.



Cheers & hth.,

- Alf

Jonathan Gardner

2/12/2010 12:37:00 AM

0

On Feb 11, 3:39 pm, Jeremy <jlcon...@gmail.com> wrote:
> I have been using Python for several years now and have never run into
> memory errors…
>
> until now.
>

Yes, Python does a good job of making memory errors the least of your
worries as a programmer. Maybe it's doing too good of a job...

> My Python program now consumes over 2 GB of memory and then I get a
> MemoryError.  I know I am reading lots of files into memory, but not
> 2GB worth.

Do a quick calculation: How much are leaving around after you read in
a file? Do you create an object for each line? What does that object
have associated with it? You may find that you have some strange
O(N^2) behavior regarding memory here. Oftentimes people forget that
you have to evaluate how your algorithm will run in time *and* memory.

> I thought I didn't have to worry about memory allocation
> in Python because of the garbage collector.

While it's not the #1 concern, you still have to keep track of how you
are using memory and try not to be wasteful. Use good algorithms, let
things fall out of scope, etc...

> 1.    When I pass a variable to the constructor of a class does it
> copy that variable or is it just a reference/pointer?  I was under the
> impression that it was just a pointer to the data.

For objects, until you make a copy, there is no copy made. That's the
general rule and even though it isn't always correct, it is correct
enough.

> 2.    When do I need to manually allocate/deallocate memory and when
> can I trust Python to take care of it?

Let things fall out of scope. If you're concerned, use delete. Try to
avoid using the global namespace for everything, and try to keep your
lists and dicts small.

> 3.    Any good practice suggestions?
>

Don't read in the entire file and then process it. Try to do line-by-
line processing.

Figure out what your algorithm is doing in terms of time *and* memory.
You likely have some O(N^2) or worse in memory usage.

Don't use Python variables to store data long-term. Instead, setup a
database or a file and use that. I'd first look at using a file, then
using SQLite, and then a full-fledged database like PostgreSQL.

Don't write processes that sit around for a long time unless you also
evaluate whether that process grows in size as it runs. If it does,
you need to figure out why and stop that memory leak.

Simpler code uses less memory. Not just because it is smaller, but
because you are not copying and moving data all over the place. See
what you can do to simplify your code. Maybe you'll expose the nasty
O(N^2) behavior.

Steven D'Aprano

2/12/2010 1:50:00 AM

0

On Thu, 11 Feb 2010 15:39:09 -0800, Jeremy wrote:

> My Python program now consumes over 2 GB of memory and then I get a
> MemoryError. I know I am reading lots of files into memory, but not 2GB
> worth.

Are you sure?

Keep in mind that Python has a comparatively high overhead due to its
object-oriented nature. If you have a list of characters:

['a', 'b', 'c', 'd']

there is the (small) overhead of the list structure itself, but each
individual character is not a single byte, but a relatively large object:

>>> sys.getsizeof('a')
32

So if you read (say) a 500MB file into a single giant string, you will
have 500MB plus the overhead of a single string object (which is
negligible). But if you read it into a list of 500 million single
characters, you will have the overhead of a single list, plus 500 million
strings, and that's *not* negligible: 32 bytes each instead of 1.

So try to avoid breaking a single huge strings into vast numbers of tiny
strings all at once.



> I thought I didn't have to worry about memory allocation in
> Python because of the garbage collector.

You don't have to worry about explicitly allocating memory, and you
almost never have to worry about explicitly freeing memory (unless you
are making objects that, directly or indirectly, contain themselves --
see below); but unless you have an infinite amount of RAM available of
course you can run out of memory if you use it all up :)


> On this note I have a few
> questions. FYI I am using Python 2.6.4 on my Mac.
>
> 1. When I pass a variable to the constructor of a class does it copy
> that variable or is it just a reference/pointer? I was under the
> impression that it was just a pointer to the data.

Python's calling model is the same whether you pass to a class
constructor or any other function or method:

x = ["some", "data"]
obj = f(x)

The function f (which might be a class constructor) sees the exact same
list as you assigned to x -- the list is not copied first. However,
there's no promise made about what f does with that list -- it might copy
the list, or make one or more additional lists:

def f(a_list):
another_copy = a_list[:]
another_list = map(int, a_list)


> 2. When do I need
> to manually allocate/deallocate memory and when can I trust Python to
> take care of it?

You never need to manually allocate memory.

You *may* need to deallocate memory if you make "reference loops", where
one object refers to itself:

l = [] # make an empty list
l.append(l) # add the list l to itself

Python can break such simple reference loops itself, but for more
complicated ones, you may need to break them yourself:

a = []
b = {2: a}
c = (None, b)
d = [1, 'z', c]
a.append(d) # a reference loop

Python will deallocate objects when they are no longer in use. They are
always considered in use any time you have them assigned to a name, or in
a list or dict or other structure which is in use.

You can explicitly remove a name with the del command. For example:

x = ['my', 'data']
del x

After deleting the name x, the list object itself is no longer in use
anywhere and Python will deallocate it. But consider:

x = ['my', 'data']
y = x # y now refers to THE SAME list object
del x

Although you have deleted the name x, the list object is still bound to
the name y, and so Python will *not* deallocate the list.

Likewise:

x = ['my', 'data']
y = [None, 1, x, 'hello world']
del x

Although now the list isn't bound to a name, it is inside another list,
and so Python will not deallocate it.



> 3. Any good practice suggestions?

Write small functions. Any temporary objects created by the function will
be automatically deallocated when the function returns.

Avoid global variables. They are a good way to inadvertently end up with
multiple long-lasting copies of data.

Try to keep data in one big piece rather than lots of little pieces.

But contradicting the above, if the one big piece is too big, it will be
hard for the operating system to swap it in and out of virtual memory,
causing thrashing, which is *really* slow. So aim for big, but not huge.

(By "big" I mean megabyte-sized; by "huge" I mean hundreds of megabytes.)

If possible, avoid reading the entire file in at once, and instead
process it line-by-line.


Hope this helps,



--
Steven

Tim Chase

2/12/2010 2:37:00 AM

0

Jonathan Gardner wrote:
> Don't use Python variables to store data long-term. Instead, setup a
> database or a file and use that. I'd first look at using a file, then
> using SQLite, and then a full-fledged database like PostgreSQL.

Just to add to the mix, I'd put the "anydbm" module on the
gradient between "using a file" and "using sqlite". It's a nice
intermediate step between rolling your own file formats for data
on disk, and having to write SQL since access is entirely like
you'd do with a regular Python dictionary.

-tkc


aahz

2/12/2010 4:58:00 AM

0

In article <mailman.2418.1265942230.28905.python-list@python.org>,
Tim Chase <python.list@tim.thechases.com> wrote:
>
>Just to add to the mix, I'd put the "anydbm" module on the gradient
>between "using a file" and "using sqlite". It's a nice intermediate
>step between rolling your own file formats for data on disk, and having
>to write SQL since access is entirely like you'd do with a regular
>Python dictionary.

Not quite. One critical difference between dbm and dicts is the need to
remember to "save" changes by setting the key's valud again.
--
Aahz (aahz@pythoncraft.com) <*> http://www.python...

"At Resolver we've found it useful to short-circuit any doubt and just
refer to comments in code as 'lies'. :-)"

Tim Chase

2/12/2010 12:16:00 PM

0

Aahz wrote:
> Tim Chase <python.list@tim.thechases.com> wrote:
>> Just to add to the mix, I'd put the "anydbm" module on the gradient
>> between "using a file" and "using sqlite". It's a nice intermediate
>> step between rolling your own file formats for data on disk, and having
>> to write SQL since access is entirely like you'd do with a regular
>> Python dictionary.
>
> Not quite. One critical difference between dbm and dicts is the need to
> remember to "save" changes by setting the key's valud again.

Could you give an example of this? I'm not sure I understand
what you're saying. I've used anydbm a bunch of times and other
than wrapping access in

d = anydbm.open(DB_NAME, "c")
# use d as a dict here
d.close()

and I've never hit any "need to remember to save changes by
setting the key's value again". The only gotcha I've hit is the
anydbm requirement that all keys/values be strings. Slightly
annoying at times, but my most frequent use case.

-tkc



aahz

2/12/2010 2:21:00 PM

0

In article <mailman.2426.1265976954.28905.python-list@python.org>,
Tim Chase <python.list@tim.thechases.com> wrote:
>Aahz wrote:
>> Tim Chase <python.list@tim.thechases.com> wrote:
>>>
>>> Just to add to the mix, I'd put the "anydbm" module on the gradient
>>> between "using a file" and "using sqlite". It's a nice intermediate
>>> step between rolling your own file formats for data on disk, and having
>>> to write SQL since access is entirely like you'd do with a regular
>>> Python dictionary.
>>
>> Not quite. One critical difference between dbm and dicts is the need to
>> remember to "save" changes by setting the key's valud again.
>
>Could you give an example of this? I'm not sure I understand
>what you're saying. I've used anydbm a bunch of times and other
>than wrapping access in
>
> d = anydbm.open(DB_NAME, "c")
> # use d as a dict here
> d.close()
>
>and I've never hit any "need to remember to save changes by
>setting the key's value again". The only gotcha I've hit is the
>anydbm requirement that all keys/values be strings. Slightly
>annoying at times, but my most frequent use case.

Well, you're more likely to hit this by wrapping dbm with shelve (because
it's a little more obvious when you're using pickle directly), but
consider this:

d = anydbm.open(DB_NAME, "c")
x = MyClass()
d['foo'] = x
x.bar = 123

Your dbm does NOT have the change to x.bar recorded, you must do this
again:

d['foo'] = x

With a dict, you have Python's reference semantics.
--
Aahz (aahz@pythoncraft.com) <*> http://www.python...

"At Resolver we've found it useful to short-circuit any doubt and just
refer to comments in code as 'lies'. :-)"

Jeremy

2/12/2010 2:46:00 PM

0

On Feb 11, 6:50 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:
> On Thu, 11 Feb 2010 15:39:09 -0800, Jeremy wrote:
> > My Python program now consumes over 2 GB of memory and then I get a
> > MemoryError.  I know I am reading lots of files into memory, but not 2GB
> > worth.
>
> Are you sure?
>
> Keep in mind that Python has a comparatively high overhead due to its
> object-oriented nature. If you have a list of characters:
>
> ['a', 'b', 'c', 'd']
>
> there is the (small) overhead of the list structure itself, but each
> individual character is not a single byte, but a relatively large object:
>
>  >>> sys.getsizeof('a')
> 32
>
> So if you read (say) a 500MB file into a single giant string, you will
> have 500MB plus the overhead of a single string object (which is
> negligible). But if you read it into a list of 500 million single
> characters, you will have the overhead of a single list, plus 500 million
> strings, and that's *not* negligible: 32 bytes each instead of 1.
>
> So try to avoid breaking a single huge strings into vast numbers of tiny
> strings all at once.
>
> > I thought I didn't have to worry about memory allocation in
> > Python because of the garbage collector.
>
> You don't have to worry about explicitly allocating memory, and you
> almost never have to worry about explicitly freeing memory (unless you
> are making objects that, directly or indirectly, contain themselves --
> see below); but unless you have an infinite amount of RAM available of
> course you can run out of memory if you use it all up :)
>
> > On this note I have a few
> > questions.  FYI I am using Python 2.6.4 on my Mac.
>
> > 1.    When I pass a variable to the constructor of a class does it copy
> > that variable or is it just a reference/pointer?  I was under the
> > impression that it was just a pointer to the data.
>
> Python's calling model is the same whether you pass to a class
> constructor or any other function or method:
>
> x = ["some", "data"]
> obj = f(x)
>
> The function f (which might be a class constructor) sees the exact same
> list as you assigned to x -- the list is not copied first. However,
> there's no promise made about what f does with that list -- it might copy
> the list, or make one or more additional lists:
>
> def f(a_list):
>     another_copy = a_list[:]
>     another_list = map(int, a_list)
>
> > 2.    When do I need
> > to manually allocate/deallocate memory and when can I trust Python to
> > take care of it?
>
> You never need to manually allocate memory.
>
> You *may* need to deallocate memory if you make "reference loops", where
> one object refers to itself:
>
> l = []  # make an empty list
> l.append(l)  # add the list l to itself
>
> Python can break such simple reference loops itself, but for more
> complicated ones, you may need to break them yourself:
>
> a = []
> b = {2: a}
> c = (None, b)
> d = [1, 'z', c]
> a.append(d)  # a reference loop
>
> Python will deallocate objects when they are no longer in use. They are
> always considered in use any time you have them assigned to a name, or in
> a list or dict or other structure which is in use.
>
> You can explicitly remove a name with the del command. For example:
>
> x = ['my', 'data']
> del x
>
> After deleting the name x, the list object itself is no longer in use
> anywhere and Python will deallocate it. But consider:
>
> x = ['my', 'data']
> y = x  # y now refers to THE SAME list object
> del x
>
> Although you have deleted the name x, the list object is still bound to
> the name y, and so Python will *not* deallocate the list.
>
> Likewise:
>
> x = ['my', 'data']
> y = [None, 1, x, 'hello world']
> del x
>
> Although now the list isn't bound to a name, it is inside another list,
> and so Python will not deallocate it.
>
> > 3.    Any good practice suggestions?
>
> Write small functions. Any temporary objects created by the function will
> be automatically deallocated when the function returns.
>
> Avoid global variables. They are a good way to inadvertently end up with
> multiple long-lasting copies of data.
>
> Try to keep data in one big piece rather than lots of little pieces.
>
> But contradicting the above, if the one big piece is too big, it will be
> hard for the operating system to swap it in and out of virtual memory,
> causing thrashing, which is *really* slow. So aim for big, but not huge.
>
> (By "big" I mean megabyte-sized; by "huge" I mean hundreds of megabytes.)
>
> If possible, avoid reading the entire file in at once, and instead
> process it line-by-line.
>
> Hope this helps,
>
> --
> Steven

Wow, what a great bunch of responses. Thank you very much. If I
understand correctly the suggestions seem to be:
1. Write algorithms to read a file one line at a time instead of
reading the whole thing
2. Use lots of little functions so that memory can fall out of
scope.

You also confirmed what I thought was true that all variables are
passed "by reference" so I don't need to worry about the data being
copied (unless I do that explicitly).

Thanks!
Jeremy

Tim Chase

2/12/2010 2:54:00 PM

0

Aahz wrote:
>>> Not quite. One critical difference between dbm and dicts
>>> is the need to remember to "save" changes by setting the
>>> key's valud again.
>>
>> Could you give an example of this? I'm not sure I
>> understand what you're saying.
>
> Well, you're more likely to hit this by wrapping dbm with shelve (because
> it's a little more obvious when you're using pickle directly), but
> consider this:
>
> d = anydbm.open(DB_NAME, "c")
> x = MyClass()
> d['foo'] = x
> x.bar = 123
>
> Your dbm does NOT have the change to x.bar recorded, you must do this
> again:
>
> d['foo'] = x
>
> With a dict, you have Python's reference semantics.

Ah, that makes sense...fallout of the "dbm only does string
keys/values". It try to adhere to the "only use strings", so I'm
more cognizant of when I martial complex data-types in or out of
those strings. But I can see where it could bite a person.

Thanks,

-tkc




Steven D'Aprano

2/12/2010 5:15:00 PM

0

On Fri, 12 Feb 2010 06:45:31 -0800, Jeremy wrote:

> You also confirmed what I thought was true that all variables are passed
> "by reference" so I don't need to worry about the data being copied
> (unless I do that explicitly).

No, but yes.

No, variables are not passed by reference, but yes, you don't have to
worry about them being copied.

You have probably been mislead into thinking that there are only two
calling conventions possible, "pass by value" and "pass by reference".
That is incorrect. There are many different calling conventions, and
different groups use the same names to mean radically different things.

If a language passes variables by reference, you can write a "swap"
function like this:

def swap(a, b):
a, b = b, a

x = 1
y = 2
swap(x, y)
assert (x == 2) and (y==1)

But this does not work in Python, and cannot work without trickery. So
Python absolutely is not "pass by reference".

On the other hand, if a variable is passed by value, then a copy is made
and you can do this:

def append1(alist):
alist.append(1) # modify the copy
return alist

x = []
newlist = append1(x)
assert x == [] # The old value still exists.

But this also doesn't work in Python! So Python isn't "pass by value"
either.

What Python does is called "pass by sharing", or sometimes "pass by
object reference". It is exactly the same as what (e.g.) Ruby and Java
do, except that confusingly the Ruby people call it "pass by reference"
and the Java people call it "pass by value", thus guaranteeing the
maximum amount of confusion possible.


More here:
http://effbot.org/zone/call-by-...
http://en.wikipedia.org/wiki/Evaluatio...




--
Steven