Reducing Pandas memory usage #3: Reading in chunks
Sometimes your data file is so large you can’t load it into memory at all, even with compression. So how do you process it quickly?
By loading and then processing the data in chunks, you can load only part of the file into memory at any given time. And that means you can process files that don’t fit in memory.
Let’s see how you can do this with Pandas’
Reading the full file
We’ll start with a program that just loads a full CSV into memory. In particular, we’re going to write a little program that loads a voter registration database, and measures how many voters live on every street in the city:
import pandas voters_street = pandas.read_csv( "voters.csv")["Residential Address Street Name "] print(voters_street.value_counts())
If we run it we get:
$ python voter-by-street-1.py MASSACHUSETTS AVE 2441 MEMORIAL DR 1948 HARVARD ST 1581 RINDGE AVE 1551 CAMBRIDGE ST 1248 ... NEAR 111 MOUNT AUBURN ST 1 SEDGEWICK RD 1 MAGAZINE BEACH PARK 1 WASHINGTON CT 1 PEARL ST AND MASS AVE 1 Name: Residential Address Street Name , Length: 743, dtype: int64
Where is memory being spent? As you would expect, the bulk of memory usage is allocated by loading the CSV into memory.
In the following graph of peak memory usage, the width of the bar indicates what percentage of the memory is used:
- The section on the left is the CSV read.
- The narrower section on the right is memory used importing all the various Python modules, in particular Pandas; unavoidable overhead, basically.
You don’t have to read it all
As an alternative to reading everything into memory, Pandas allows you to read data in chunks. In the case of CSV, we can load only some of the lines into memory at any given time.
In particular, if we use the
chunksize argument to
pandas.read_csv, we get back an iterator over
DataFrames, rather than one single
DataFrame is the next 1000 lines of the CSV:
import pandas result = None for chunk in pandas.read_csv("voters.csv", chunksize=1000): voters_street = chunk[ "Residential Address Street Name "] chunk_result = voters_street.value_counts() if result is None: result = chunk_result else: result = result.add(chunk_result, fill_value=0) result.sort_values(ascending=False, inplace=True) print(result)
When we run this we get basically the same results:
$ python voter-by-street-2.py MASSACHUSETTS AVE 2441.0 MEMORIAL DR 1948.0 HARVARD ST 1581.0 RINDGE AVE 1551.0 CAMBRIDGE ST 1248.0 ...
If we look at the memory usage, we’ve reduced memory usage so much that the memory usage is now dominated by importing Pandas; the actual code barely uses anything:
The MapReduce idiom
Taking a step back, what we have here is an highly simplified instance of the MapReduce programming model. While typically used in distributed systems, where chunks are processed in parallel and therefore handed out to worker processes or even worker machines, you can still see it at work in this example.
In the simple form we’re using, MapReduce chunk-based processing has just two steps:
- For each chunk you load, you map or apply a processing function.
- Then, as you accumulate results, you “reduce” them by combining partial results into the final result.
We can re-structure our code to make this simplified MapReduce model more explicit:
import pandas from functools import reduce def get_counts(chunk): voters_street = chunk[ "Residential Address Street Name "] return voters_street.value_counts() def add(previous_result, new_result): return previous_result.add(new_result, fill_value=0) # MapReduce structure: chunks = pandas.read_csv("voters.csv", chunksize=1000) processed_chunks = map(get_counts, chunks) result = reduce(add, processed_chunks) result.sort_values(ascending=False, inplace=True) print(result)
Both reading chunks and
map() are lazy, only doing work when they’re iterated over.
As a result, chunks are only loaded in to memory on-demand when
reduce() starts iterating over
Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.
Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.
From full reads to chunked reads
You’ll notice in the code above that
get_counts() could just as easily have been used in the original version, which read the whole CSV into memory:
def get_counts(chunk): voters_street = chunk[ "Residential Address Street Name "] return voters_street.value_counts() result = get_counts(pandas.read_csv("voters.csv"))
That’s because reading everything at once is a simplified version of reading in chunks: you only have one chunk, and therefore don’t need a reducer function.
So here’s how you can go from code that reads everything at once to code that reads in chunks:
- Separate the code that reads the data from the code that processes the data.
- Use the new processing function, by mapping it across the results of reading the file chunk-by-chunk.
- Figure out a reducer function that can combine the processed chunks into a final result.
Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.
Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler
Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.
Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support, on macOS and Linux, with built-in support for Jupyter notebooks.
Learn practical Python software engineering skills you can use at your job
Sign up for my newsletter, and join over 6900 Python developers and data scientists learning practical tools and techniques, from Python performance to Docker packaging, with a free new article in your inbox every week.