Measuring the memory usage of a Pandas DataFrame
How much memory are your Pandas DataFrame or Series using? Pandas provides an API for measuring this information, but a variety of implementation details means the results can be confusing or misleading.
Consider the following example:
>>> import pandas as pd >>> series = pd.Series(["abcdefhjiklmnopqrstuvwxyz" * 10 ... for i in range(1_000_000)]) >>> series.memory_usage() 8000128 >>> series.memory_usage(deep=True) 307000128
Which is correct, is memory usage 8MB or 300MB? Neither!
In this special case, it’s actually 67MB, at least with the default Python interpreter. This is partially because I cheated, and often 300MB will actually be closer to the truth.
What’s going on? Let’s find out!
The easy case: numbers and other fixed-size objects
Most Pandas columns are stored as NumPy arrays, and for types like integers or floats the values are stored inside the array itself. For example, if you have an array with 1,000,000 64-bit integers, each integer will always use 8 bytes of memory. The array in total will therefore use 8,000,000 bytes of RAM, plus some minor bookkeeping overhead:
>>> series = pd.Series( * 1_000_000, dtype=np.int64) >>> series.memory_usage() 8000128 >>> series.memory_usage(deep=True) 8000128
We’ll get to the
deep option in just a little bit, but notice it makes no difference in this case.
For any type that has a fixed size in memory–integers, floats, categoricals, and so on–both
memory_usage() variants should give the same answer, and a pretty accurate one.
The hard case: arbitrarily-sized objects
Different Python strings use different amounts of memory: the string
"abc" will use far less memory than a string containing the complete works of William Shakespeare.
More generally, storing arbitrary Python objects requires arbitrary amounts of memory.
So how to represent these differently sized strings in NumPy, and by extension in Pandas?
Instead of the storing the actual strings, NumPy stores an array of pointers to those objects; each pointer takes 8 bytes on modern computers.
The pointers point to an address in memory where the string is actually stored.
And that brings us to the
By default, Pandas returns the memory used just by the NumPy array it’s using to store the data.
For strings, this is just 8 multiplied by the number of strings in the column, since NumPy is just storing 64-bit pointers.
However, that’s not all the memory being used: there’s also the memory being used by the strings themselves.
deep=False, the memory used by the strings is not counted; with
deep=True, it is.
We can use
sys.sizeof() to get the memory usage of an individual object, and we can use this to verify what
deep=True is measuring:
>>> import pandas as pd >>> series = pd.Series(["abcdefhjiklmnopqrstuvwxyz" * 10 ... for i in range(1_000_000)]) >>> series.memory_usage() 8000128 >>> series.memory_usage(deep=True) 307000128 >>> sum([sys.getsizeof(s) for s in series]) + series.memory_usage() 317000128
Not exactly the same, but close enough.
In general, you’ll want to use
memory_usage(deep=True), since it will give you a more accurate answer.
Strings are special
We have determined that
memory_usage(deep=True) is the correct number, so we’re done, right?
Not quite yet.
Let’s try measuring the memory usage of the above code with the Fil memory profiler, to see what was actually allocated, rather than Pandas’ estimate.
A bunch of memory is being used by just import Pandas and its dependencies, but if we focus on the memory usage of creating the series, on the right, we see it’s using… 67MB?
Shouldn’t it be using 317MB?
Some of the memory is temporary objects that get deallocated, like the initial list that gets converted into a NumPy array, so the actual
pd.Series itself is even smaller.
What’s going on? It turns out the default implementation of Python has a memory optimization for strings. If you repeatedly create the same string multiple times, Python will sometimes cache–or “intern”–it in memory and reuse it for later string objects. This works fine given that string objects are immutable.
>>> s = "abcdefghijklmnopqrstuvwxyz" * 10 >>> s2 = "abcdefghijklmnopqrstuvwxyz" * 10 >>> s is s2 True
In our original example, we created the same string a million times, so Python was smart enough to store it only once, saving lots of memory. Python won’t do this for all strings, however, and the rules vary by version of Python.
In practice, you shouldn’t rely on this optimization: if you are storing only a fixed number of strings, consider using a categorical dtype to save even more memory.
Seriesto get mostly-accurate memory usage.
- To measure peak memory usage accurately, including temporary objects you might not think of, consider using Fil.
- Python strings use a lot of memory! Consider alternatives like categoricals when you can.
Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.
Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler
Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.
Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support, on macOS and Linux, with built-in support for Jupyter notebooks.
Learn practical Python software engineering skills you can use at your job
Sign up for my newsletter, and join over 7000 Python developers and data scientists learning practical tools and techniques, from Python performance to Docker packaging, with a free new article in your inbox every week.