Measuring the memory usage of a Pandas DataFrame
How much memory are your Pandas DataFrame or Series using? Pandas provides an API for measuring this information, but a variety of implementation details means the results can be confusing or misleading.
Consider the following example:
>>> import pandas as pd >>> series = pd.Series(["abcdefhjiklmnopqrstuvwxyz" * 10 ... for i in range(1_000_000)]) >>> series.memory_usage() 8000128 >>> series.memory_usage(deep=True) 307000128
Which is correct, is memory usage 8MB or 300MB? Neither!
In this special case, it’s actually 67MB, at least with the default Python interpreter. This is partially because I cheated, and often 300MB will actually be closer to the truth.
What’s going on? Let’s find out!
The easy case: numbers and other fixed-size objects
Most Pandas columns are stored as NumPy arrays, and for types like integers or floats the values are stored inside the array itself. For example, if you have an array with 1,000,000 64-bit integers, each integer will always use 8 bytes of memory. The array in total will therefore use 8,000,000 bytes of RAM, plus some minor bookkeeping overhead:
>>> series = pd.Series( * 1_000_000, dtype=np.int64) >>> series.memory_usage() 8000128 >>> series.memory_usage(deep=True) 8000128
We’ll get to the
deep option in just a little bit, but notice it makes no difference in this case.
For any type that has a fixed size in memory–integers, floats, categoricals, and so on–both
memory_usage() variants should give the same answer, and a pretty accurate one.
The hard case: arbitrarily-sized objects
Different Python strings use different amounts of memory: the string
"abc" will use far less memory than a string containing the complete works of William Shakespeare.
More generally, storing arbitrary Python objects requires arbitrary amounts of memory.
So how to represent these differently sized strings in NumPy, and by extension in Pandas?
Instead of the storing the actual strings, NumPy stores an array of pointers to those objects; each pointer takes 8 bytes on modern computers.
The pointers point to an address in memory where the string is actually stored.
And that brings us to the
By default, Pandas returns the memory used just by the NumPy array it’s using to store the data.
For strings, this is just 8 multiplied by the number of strings in the column, since NumPy is just storing 64-bit pointers.
However, that’s not all the memory being used: there’s also the memory being used by the strings themselves.
deep=False, the memory used by the strings is not counted; with
deep=True, it is.
We can use
sys.sizeof() to get the memory usage of an individual object, and we can use this to verify what
deep=True is measuring:
>>> import pandas as pd >>> series = pd.Series(["abcdefhjiklmnopqrstuvwxyz" * 10 ... for i in range(1_000_000)]) >>> series.memory_usage() 8000128 >>> series.memory_usage(deep=True) 307000128 >>> sum([sys.getsizeof(s) for s in series]) + series.memory_usage() 317000128
Not exactly the same, but close enough.
In general, you’ll want to use
memory_usage(deep=True), since it will give you a more accurate answer.
Strings are special
We have determined that
memory_usage(deep=True) is the correct number, so we’re done, right?
Not quite yet.
Let’s try measuring the memory usage of the above code with the Fil memory profiler, to see what was actually allocated, rather than Pandas’ estimate.
A bunch of memory is being used by just import Pandas and its dependencies, but if we focus on the memory usage of creating the series, on the right, we see it’s using… 67MB?
Shouldn’t it be using 317MB?
Some of the memory is temporary objects that get deallocated, like the initial list that gets converted into a NumPy array, so the actual
pd.Series itself is even smaller.
What’s going on? It turns out the default implementation of Python has a memory optimization for strings. If you repeatedly create the same string multiple times, Python will sometimes cache–or “intern”–it in memory and reuse it for later string objects. This works fine given that string objects are immutable.
>>> s = "abcdefghijklmnopqrstuvwxyz" * 10 >>> s2 = "abcdefghijklmnopqrstuvwxyz" * 10 >>> s is s2 True
In our original example, we created the same string a million times, so Python was smart enough to store it only once, saving lots of memory. Python won’t do this for all strings, however, and the rules vary by version of Python.
In practice, you shouldn’t rely on this optimization: if you are storing only a fixed number of strings, consider using a categorical dtype to save even more memory.
Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.
Wasting compute money on processes that use too much memory?
Your Python batch process is using too much memory, and you have no idea which part of your code is responsible.
You need a tool that will tell you exactly where to focus your optimization efforts, a tool designed for data scientists and scientists. Learn how the Fil memory profiler can help you.
How do you process large datasets with limited memory?
Get a free cheatsheet summarizing how to process large amounts of data with limited memory using Python, NumPy, and Pandas.
Plus, every week or so you’ll get new articles showing you how to process large data, and more generally improve you software engineering skills, from testing to packaging to performance: