Too many objects: Reducing memory overhead from Python instances

Every time you create an instance of a class in Python, you are using up some memory–including overhead that might actually be larger than the data you care about. Create a million objects, and you have a million times the overhead.

And that overhead can add up, either preventing you from running your program, or increasing the amount of money you spend on provisioning hardware.

So let’s see how big that overhead really is (sneak preview: it’s large!) and what you can do about it.

Pay no attention to the dictionary behind the curtain

In Python, behind the scenes every instance of a normal class stores its attributes in a dictionary.

Thus memory usage for a normal object has three sources:

  1. The normal overhead of any Python object, be it instance, integer, or what have you, plus the overhead of an empty dictionary.
  2. The overhead of storing entries in the dictionary.
  3. The actual data being added as attributes.

For example:

from random import random

class Point:
    def __init__(self, x):
        self.x = x

objects = []
for _ in range(1000000):
    r = random()
    point = Point(r)

We can visualize the peak memory use:

And we can see the memory usage of those three categories, plus a fourth additional one:

  1. Point objects in general: 30% of memory.
  2. Adding an attribute to Point’s dictionary: 55% of memory.
  3. The floating point numbers: 11% of memory.
  4. The list storing the Point objects: 4% of memory.

Basically, memory usage is at least 10x as high as the actual information we care about, item 3, the random floating point numbers.

Solution #1: Good bye, dictionary!

Having a dictionary for every object makes sense if you want to add arbitrary attributes to any given object. Most of the time we don’t want to do that: there are a certain set of attributes we know a class will have, and that’s it.

Enter __slots__. By setting this attribute on a class, with a list of strings indicating a list of attributes:

  1. Only those attributes will be allowed.
  2. More importantly for our purposes, Python won’t create a dictionary for every object.

All we have to do is add one line of code:

from random import random

class Point:
    __slots__ = ["x"]  # <-- allowed attributes
    def __init__(self, x):
        self.x = x

objects = []
for _ in range(1000000):
    r = random()
    point = Point(r)

Now, we can measure memory use:

The overhead for the dictionary is now gone, and memory use has reduced by 60%, from 207MB to 86MB. Not bad for one line of code!

Solution #2: Get rid of objects

Another approach to the problem is to note that storing a list of a million identical objects is rather wasteful, especially if operations will happen on groups of objects. So instead of creating an object per point, why not just create a list per attribute?

from random import random

points = {
    "x": [],
    # "y": [],
    # "z": []
    # etc.

for _ in range(1000000):
    r = random()

Memory usage is now reduced to 30MB, down 85% from the original 206MB:

Bonus, even-better solution: Pandas or NumPy

At this point most of the overhead is due to the overhead of having a Python object per floating point number.

So you can reduce memory usage even further, to about 8MB, by using a Pandas DataFrame to store the information: it will use NumPy arrays to efficiently store the numbers internally.

If you’re not interesting in using Pandas, there’s also a NumPy feature called “structured arrays” that lets you give names to fields.

Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.

Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.

A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage
A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks

Other approaches

In general, storing too many Python objects at once will waste memory. As always, solutions can involve compression, batching, or indexing:

  • The solutions I’ve covered in this article focus on compression: the same information stored with less overhead.
  • If you don’t need to store all the data in memory at once, you can process data in batches, for example by returning data via a generator.
  • Finally, you can try to only load only the data you actually care about by using indexing.

Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.