Don’t bother trying to estimate Pandas memory usage
You have a file with data you want to process with Pandas, and you want to make sure you won’t run out of memory. How do you estimate memory usage given the file size?
At times you may see estimates like these:
- “Have 5 to 10 times as much RAM as the size of your dataset”, or
- “several times the size of your dataset”, or
- 2×-3× the size of the dataset.
All of these estimates can both under- and over-estimate memory usage, depending on the situation. In fact, I will go so far as to say that estimating memory usage is just not worth doing.
In particular, this article will:
- Demonstrate the very broad range of memory usage you will see just from loading the data, before any processing is done.
- Cover alternative approaches to estimation: measurement and streaming.
“Processing” is too broad to be meaningful
What’s a good heuristic for estimating memory usage needed to process a dataset of a given file size? It’s hard to say, since processing data can mean many things. To give two extremes:
- Maybe you’re calculating the maximum value of a column, in which case memory usage is basically just whatever was needed to load the data.
- Alternatively, you might be doing a cross join between two columns of length M and N, in which case memory usage is going to be M×N; good luck if those are large numbers.
So just to begin with, estimating memory usage is extremely situation-specific. You need to understand what your code is doing and the details of Pandas’ implementation. Generic multiples are unlikely to be helpful.
But there is some baseline information that would be useful: the amount of memory needed for a given DataFrame
or Series
.
That is a key input in estimating memory usage for both our examples above.
So let’s ask a simpler question. What’s a good heuristic for estimating the memory usage needed to load a dataset of a given file size? The answer: 🤷 it depends.
Memory usage can be much smaller than file size
Sometimes, memory usage will be much smaller than the size of the input file. Let’s generate a million-row CSV with three numeric columns; the first column will range from 0 to 100, the second from 0 to 10,000, and the third from 0 to 1,000,000.
import csv
from random import randint
with open("3columns.csv", "w") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["A", "B", "C"])
for _ in range(1_000_000):
writer.writerow(
[
randint(0, 100),
randint(0, 10_000),
randint(0, 1_000_000_000),
]
)
The resulting file is 18MB.
Let’s load the data into appropriately sized dtype
s, and measure memory usage:
import pandas as pd
import numpy as np
df = pd.read_csv(
"3columns.csv",
dtype={"A": np.uint8, "B": np.uint16, "C": np.uint32},
)
print(
"{:.1f}".format(df.memory_usage(deep=True).sum() / (1024 * 1024))
)
The resulting memory usage when we run the above: 6.7MB. That’s 0.4× of the original file size!
Maybe Parquet can help?
When representing numbers as human-readable text, as is the case with CSVs, the data can take more bytes than the in-memory representation.
For example, ~90% of our random values between 0 and 1,000,000 will have 6 digits, and therefore require 6 bytes in the CSV.
In-memory, however, each np.int32
only requires 4 bytes.
Plus, a CSV has all those commas and newlines.
So perhaps a different, better file format might give us a better way to estimate memory usage given file size? A Parquet file, for example, stores data in a way that matches the memory representation much more closely.
And in fact, the equivalent Parquet file for our example above is 7MB, essentially the same as the memory usage. So is Parquet file size a good proxy for memory usage?
Memory usage can be much larger than file size
Parquet file size can be misleading
Let’s load another Parquet file I have lying around, from a previous article. The file is 20MB. The memory usage when loaded is… 300MB.
Part of the reason: Parquet can compress data on disk. Our original example used random data, which means compression doesn’t help much. This particular file’s data is much more structured, and therefore compresses much better. It’s difficult to generalize about how well data will compress, but we can certainly imagine cases where it will compress even more.
The other reason data might be smaller on disk than in memory: string representation.
Strings are fun
Can an uncompressed Parquet file help us estimate the memory usage? For numeric data, we might actually reasonably expect a 1-to-1 ratio. For strings… not necessarily.
Let’s create an uncompressed Parquet with a string column; the strings look like "A B A D E"
:
from itertools import product
import pandas as pd
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
data = {
"strings": [
" ".join(values) for values in product(
*([alphabet] * 5)
)
]
}
df = pd.DataFrame(data)
print(
"{:.1f}".format(df.memory_usage(deep=True).sum() / (1024 * 1024))
)
df.to_parquet("strings.parquet", compression=None)
The file size is a 148MB, and there is no compression. The memory usage is 748MB, 5× larger. The difference is because Pandas and Parquet represent strings differently; Parquet uses UTF-8, like Arrow does.
To make things even more complicated, Python string memory usage also varies depending on whether interning happens (a CPython implementation detail that may change across Python versions), and whether your strings are in ASCII or have Chinese characters or include an emoji.
>>> import sys
>>> sys.getsizeof("0123456789")
59
>>> sys.getsizeof("012345678惑")
94
>>> sys.getsizeof("012345678😕")
116
Another caveat: inefficient loading
We’ve been measuring DataFrame
memory usage, and using it as a proxy for the memory usage for loading the data.
In some cases, however, loading the data into the DataFrame
can take much more memory than the final DataFrame
does.
For example:
- Loading a Parquet with PyArrow might double memory usage compared to using
fastparquet
. - Loading data from a SQL database can be inefficient if you’re not careful.
File size will not tell you memory usage
Let’s summarize our examples so far:
File | Size in memory (MB) | Size on disk (MB) | Ratio |
---|---|---|---|
3columns.csv | 7 | 18 | 0.4× |
3columns.parquet | 7 | 7 | 1.0× |
strings.parquet | 748 | 148 | 5.0× |
MBTA.parquet | 300 | 20 | 15.0× |
Memory usage can be…
…smaller than the file size.
…the same.
…much larger.
There is no general heuristic that will tell you how much Pandas memory usage to expect just for loading a file, let alone how much memory you will need to process the data.
It would not surprise me if uncompressed Parquet files do give you a decent estimate of Polars (not Pandas!) memory usage in eager mode, but I haven’t done the research to validate this guess.
The alternatives: measurement and chunking/streaming
Given estimation is often impractical, what can you do?
Measuring memory usage
Instead of guessing how much memory your data processing will need, you can measure with a profiler.
- Memray or Fil will help you find which part of your code is using the most memory.
- Sciagraph will do both and memory and performance profiling, and aims to be low overhead enough that you can also run it in production.
These tools can help you find the actual memory bottlenecks in your code, so you can then reduce them if necessary. If your data doesn’t fit in memory at the moment, you’ll need to do this measurement with a subset of the data, or possibly Fil’s out-of-memory detection might help.
Structuring your code to use chunking or streaming
Alternatively, you can just not use as much memory. By restructuring your code to use streaming (aka “chunking”), you can process files of any size with a fixed amount of memory. You can do this:
- Directly in Pandas.
- With Dask’s wrapper for Pandas.
- By switching to Polars’ lazy mode, combined with the streaming option.