Reducing Pandas memory usage #3: Reading in chunks
Sometimes your data file is so large you can’t load it into memory at all, even with compression. So how do you process it quickly?
By loading and then processing the data in chunks, you can load only part of the file into memory at any given time. And that means you can process files that don’t fit in memory.
Let’s see how you can do this with Pandas’ chunksize
option.
Reading the full file
We’ll start with a program that just loads a full CSV into memory. In particular, we’re going to write a little program that loads a voter registration database, and measures how many voters live on every street in the city:
import pandas
voters_street = pandas.read_csv(
"voters.csv")["Residential Address Street Name "]
print(voters_street.value_counts())
If we run it we get:
$ python voter-by-street-1.py
MASSACHUSETTS AVE 2441
MEMORIAL DR 1948
HARVARD ST 1581
RINDGE AVE 1551
CAMBRIDGE ST 1248
...
NEAR 111 MOUNT AUBURN ST 1
SEDGEWICK RD 1
MAGAZINE BEACH PARK 1
WASHINGTON CT 1
PEARL ST AND MASS AVE 1
Name: Residential Address Street Name , Length: 743, dtype: int64
Where is memory being spent? As you would expect, the bulk of memory usage is allocated by loading the CSV into memory.
In the following graph of peak memory usage, the width of the bar indicates what percentage of the memory is used:
- The section on the left is the CSV read.
- The narrower section on the right is memory used importing all the various Python modules, in particular Pandas; unavoidable overhead, basically.
You don’t have to read it all
As an alternative to reading everything into memory, Pandas allows you to read data in chunks. In the case of CSV, we can load only some of the lines into memory at any given time.
In particular, if we use the chunksize
argument to pandas.read_csv
, we get back an iterator over DataFrame
s, rather than one single DataFrame
.
Each DataFrame
is the next 1000 lines of the CSV:
import pandas
result = None
for chunk in pandas.read_csv("voters.csv", chunksize=1000):
voters_street = chunk[
"Residential Address Street Name "]
chunk_result = voters_street.value_counts()
if result is None:
result = chunk_result
else:
result = result.add(chunk_result, fill_value=0)
result.sort_values(ascending=False, inplace=True)
print(result)
When we run this we get basically the same results:
$ python voter-by-street-2.py
MASSACHUSETTS AVE 2441.0
MEMORIAL DR 1948.0
HARVARD ST 1581.0
RINDGE AVE 1551.0
CAMBRIDGE ST 1248.0
...
If we look at the memory usage, we’ve reduced memory usage so much that the memory usage is now dominated by importing Pandas; the actual code barely uses anything:
The MapReduce idiom
Taking a step back, what we have here is an highly simplified instance of the MapReduce programming model. While typically used in distributed systems, where chunks are processed in parallel and therefore handed out to worker processes or even worker machines, you can still see it at work in this example.
In the simple form we’re using, MapReduce chunk-based processing has just two steps:
- For each chunk you load, you map or apply a processing function.
- Then, as you accumulate results, you “reduce” them by combining partial results into the final result.
We can re-structure our code to make this simplified MapReduce model more explicit:
import pandas
from functools import reduce
def get_counts(chunk):
voters_street = chunk[
"Residential Address Street Name "]
return voters_street.value_counts()
def add(previous_result, new_result):
return previous_result.add(new_result, fill_value=0)
# MapReduce structure:
chunks = pandas.read_csv("voters.csv", chunksize=1000)
processed_chunks = map(get_counts, chunks)
result = reduce(add, processed_chunks)
result.sort_values(ascending=False, inplace=True)
print(result)
Both reading chunks and map()
are lazy, only doing work when they’re iterated over.
As a result, chunks are only loaded in to memory on-demand when reduce()
starts iterating over processed_chunks
.
Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.
Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.
From full reads to chunked reads
You’ll notice in the code above that get_counts()
could just as easily have been used in the original version, which read the whole CSV into memory:
def get_counts(chunk):
voters_street = chunk[
"Residential Address Street Name "]
return voters_street.value_counts()
result = get_counts(pandas.read_csv("voters.csv"))
That’s because reading everything at once is a simplified version of reading in chunks: you only have one chunk, and therefore don’t need a reducer function.
So here’s how you can go from code that reads everything at once to code that reads in chunks:
- Separate the code that reads the data from the code that processes the data.
- Use the new processing function, by mapping it across the results of reading the file chunk-by-chunk.
- Figure out a reducer function that can combine the processed chunks into a final result.