Measuring the memory usage of a Pandas DataFrame

by Itamar Turner-Trauring
Last updated 01 Oct 2021, originally created 28 Jun 2021

How much memory are your Pandas DataFrame or Series using? Pandas provides an API for measuring this information, but a variety of implementation details means the results can be confusing or misleading.

Consider the following example:

>>> import pandas as pd
>>> series = pd.Series(["abcdefhjiklmnopqrstuvwxyz" * 10
...                     for i in range(1_000_000)])
>>> series.memory_usage()
8000128
>>> series.memory_usage(deep=True)
307000128

Which is correct, is memory usage 8MB or 300MB? Neither!

In this special case, it’s actually 67MB, at least with the default Python interpreter. This is partially because I cheated, and often 300MB will actually be closer to the truth.

What’s going on? Let’s find out!

The easy case: numbers and other fixed-size objects

Most Pandas columns are stored as NumPy arrays, and for types like integers or floats the values are stored inside the array itself. For example, if you have an array with 1,000,000 64-bit integers, each integer will always use 8 bytes of memory. The array in total will therefore use 8,000,000 bytes of RAM, plus some minor bookkeeping overhead:

>>> series = pd.Series([123] * 1_000_000, dtype=np.int64)
>>> series.memory_usage()
8000128
>>> series.memory_usage(deep=True)
8000128

We’ll get to the deep option in just a little bit, but notice it makes no difference in this case.

For any type that has a fixed size in memory–integers, floats, categoricals, and so on–both memory_usage() variants should give the same answer, and a pretty accurate one.

The hard case: arbitrarily-sized objects

Different Python strings use different amounts of memory: the string "abc" will use far less memory than a string containing the complete works of William Shakespeare.

More generally, storing arbitrary Python objects requires arbitrary amounts of memory.

So how to represent these differently sized strings in NumPy, and by extension in Pandas? Instead of the storing the actual strings, NumPy stores an array of pointers to those objects; each pointer takes 8 bytes on modern computers. The pointers point to an address in memory where the string is actually stored. And that brings us to the deep option.

By default, Pandas returns the memory used just by the NumPy array it’s using to store the data. For strings, this is just 8 multiplied by the number of strings in the column, since NumPy is just storing 64-bit pointers. However, that’s not all the memory being used: there’s also the memory being used by the strings themselves. With deep=False, the memory used by the strings is not counted; with deep=True, it is.

We can use sys.sizeof() to get the memory usage of an individual object, and we can use this to verify what deep=True is measuring:

>>> import pandas as pd
>>> series = pd.Series(["abcdefhjiklmnopqrstuvwxyz" * 10
...                     for i in range(1_000_000)])
>>> series.memory_usage()
8000128
>>> series.memory_usage(deep=True)
307000128
>>> sum([sys.getsizeof(s) for s in series])
         + series.memory_usage()
317000128

Not exactly the same, but close enough. In general, you’ll want to use memory_usage(deep=True), since it will give you a more accurate answer.

Strings are special

We have determined that memory_usage(deep=True) is the correct number, so we’re done, right? Not quite yet.

Let’s try measuring the memory usage of the above code with the Fil memory profiler, to see what was actually allocated, rather than Pandas’ estimate.

A bunch of memory is being used by just import Pandas and its dependencies, but if we focus on the memory usage of creating the series, on the right, we see it’s using… 67MB? Shouldn’t it be using 317MB? Some of the memory is temporary objects that get deallocated, like the initial list that gets converted into a NumPy array, so the actual pd.Series itself is even smaller.

What’s going on? It turns out the default implementation of Python has a memory optimization for strings. If you repeatedly create the same string multiple times, Python will sometimes cache–or “intern”–it in memory and reuse it for later string objects. This works fine given that string objects are immutable.

>>> s = "abcdefghijklmnopqrstuvwxyz" * 10
>>> s2 = "abcdefghijklmnopqrstuvwxyz" * 10
>>> s is s2
True

In our original example, we created the same string a million times, so Python was smart enough to store it only once, saving lots of memory. Python won’t do this for all strings, however, and the rules vary by version of Python.

In practice, you shouldn’t rely on this optimization: if you are storing only a fixed number of strings, consider using a categorical dtype to save even more memory.

Takeaways

Use memory_usage(deep=True) on a DataFrame or Series to get mostly-accurate memory usage.
To measure peak memory usage accurately, including temporary objects you might not think of, consider using Fil.
Python strings use a lot of memory! Consider alternatives like categoricals when you can.

Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.

Consulting services: take your code from prototype to production

You have a working Python prototype for your data processing algorithm. Now you need to get it ready for production. Which means your software needs to be fast, robust, maintainable, cost-efficient, and scalable.

With more than 25 years experience of shipping software to production, I can help you:

Speed up your code so it can get results on time, and run at scale with an affordable operating budget.

Learn about tools, techniques, and process improvements that will help you ship best-practices software, on schedule.

To get in touch about consulting services, send me an email at itamar@pythonspeed.com.

Speed up your Python code and learn skills you can use at your job

Join over 7600 Python developers and data scientists learning practical tools and techniques every week, from Python performance to Docker packaging, by signing up for my newsletter.