Reducing Pandas memory usage #1: lossless compression

by Itamar Turner-Trauring
Last updated 06 Jan 2023, originally created 18 Nov 2019

You’re loading a CSV into Pandas, and it’s using too much RAM: your program crashes if you load the whole thing. How do you reduce memory usage without changing any of your processing code?

In this article I’ll show you how to reduce the memory your DataFrame uses at the time it is initially loaded, using four different techniques:

Dropping columns
Lower-range numerical dtypes.
Categoricals.
Sparse columns.

Technique #1: Don’t load all the columns

Quite often the CSV you’re loading will include columns you don’t actually use when processing the data. If you don’t use them, there’s no point in loading them!

In the following example, I am loading the full CSV, checking how much memory is used by the DataFrame, and then shrinking down to just the columns I’m interested in:

>>> import pandas as pd
>>> df = pd.read_csv("voters.csv")
>>> df.info(verbose=False, memory_usage="deep")
...
memory usage: 71.2 MB
>>> df = df[["First Name ", "Last Name "]]
>>> df.info(verbose=False, memory_usage="deep")
...
memory usage: 8.3 MB

Peak memory usage is 71MB, even though we’re only really using 8MB of data. Note that these usage numbers are somewhat inaccurate; the important thing is the ratio.

But since we know in advance we only need those two columns, we don’t need to load everything, we can just load only the columns we care about, thus reducing peak memory usage to 8MB:

>>> df = pd.read_csv(
...     "voters.csv", usecols=["First Name ", "Last Name "])
>>> df.info(verbose=False, memory_usage="deep")
...
memory usage: 8.3 MB

Technique #2: Shrink numerical columns with smaller dtypes

Another technique can help reduce the memory used by columns that contain only numbers.

Each column in a Pandas DataFrame is a particular data type (dtype). For example, for integers there is the int64 dtype, int32, int16, and more.

Why does the dtype matter? First, because it affects what values you can store in that column:

int8 can store integers from -128 to 127.
int16 can store integers from -32768 to 32767.
int64 can store integers from -9223372036854775808 to 9223372036854775807.

Second, the larger the range, the more memory is used. For example, int64 uses 4× as much memory as int16, and 8× as much as int8.

By default when Pandas loads a CSV, it guesses at the dtypes. If it decides a column volumes are all integers, by default it assigns that column int64 as the dtype.

As a result, if you know that the numbers in a particular column will never be higher than 32767, you can use an int16 and reduce the memory usage of that column by 75%. And if the values will never be higher than 127, you can use an int8, using even less memory.

For example, here we see how much memory a column uses (the nbytes attribute) when loaded with the default int64 vs. when we load it with int8. We know that int8 won’t lose any data because the numbers in the column only range from 1 to 11:

>>> df = pd.read_csv("voters.csv")
>>> df["Ward Number "].memory_usage(index=False, deep=True)
550688
>>> df["Ward Number "].max()
11
>>> df["Ward Number "].min()
1
>>> df = pd.read_csv("voters.csv", dtype={"Ward Number ": "int8"})
>>> df["Ward Number "].memory_usage(index=False, deep=True)
68836

Notice how much less memory the column is using.

Technique #3: Shrink categorical data using Categorical dtypes

What about non-numerical data? In some cases you can shrink those columns as well.

In the voter registration database I am using as an example there is a column for party affiliation, by default parsed as a string, with only a fixed number of values in it. Political parties don’t get formed overnight, after all.

>>> set(df["Party Affiliation "])
{'Q ', 'S ', 'L ', 'R ', 'H ', 'BB', 'D ', 'K ', 'O ', 'X ', 'A ', 'Z ', 'EE', 'F ', 'P ', 'G ', 'T ', 'CC', 'J ', 'AA', 'Y ', 'U '}
>>> df["Party Affiliation "].memory_usage(index=False, deep=True)
4061324

Even though the values are limited to a specific set, they are still stored as arbitrary strings, which have overhead in memory. Imagine a gender column that only says "FEMALE", "MALE", and "NON-BINARY" over and over again—that’s a lot of memory being used to store the same three strings.

A more compact representation for data with only a limited number of values is a custom dtype called Categorical, whose memory usage is tied to the number of different values.

When we load the CSV, we can specify that a particular column uses this dtype, and the resulting memory usage is much smaller, 69KB instead of 550KB:

>>> df = pd.read_csv(
...     "voters.csv", dtype={"Party Affiliation ": "category"})
>>> df["Party Affiliation "].head()
0    U 
1    U 
2    U 
3    R 
4    U 
Name: Party Affiliation , dtype: category
Categories (22, object): [A, AA, CC, D, ..., Y, Z, BB, EE]
>>> df["Party Affiliation "].memory_usage(index=False, deep=True)
70774

That’s much smaller.

Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.

Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.

Technique #4: Sparse series

If you have a column with lots of empty values, usually represented as NaNs, you can save memory by using a sparse column representation. It won’t waste memory storing all those empty values.

For example, three quarters of this column are empty values:

>>> df = pd.read_csv("voters.csv")
>>> series = df["Mailing Address - Apartment Number "]
>>> series.memory_usage(index=False, deep=True)
2623975
>>> len(series)
68836
>>> len(series.dropna())
13721

We can convert the series to a sparse series, and then it’ll use less memory:

>>> sparse_series = series.astype("Sparse[str]")
>>> len(sparse_series)
68836
>>> sparse_series.memory_usage(index=False, deep=True)
2237939

In this case the memory savings aren’t huge (2.6MB → 2.2MB), but there is some saving; for integers or other number-like types the savings would be more significant.

Unfortunately it’s not yet possible to use read_csv() to load a column directly into a sparse dtype.

Compression is your friend

As you’ve seen, simply by changing a couple of arguments to pandas.read_csv(), you can significantly shrink the amount of memory your DataFrame uses. Same data, less RAM: that’s the beauty of compression.

Need even more memory reduction? You can use lossy compression or process your data in chunks.

Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.

Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler

Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.

Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support, on macOS and Linux, with built-in support for Jupyter notebooks.

Speed up your Python code and learn skills you can use at your job

Join over 8000 Python developers and data scientists learning practical tools and techniques every week, from Python performance to Docker packaging, by signing up for my newsletter.