Reducing Pandas memory usage #3: Reading in chunks
Sometimes your data file is so large you can’t load it into memory at all, even with compression. So how do you process it quickly?
By loading and then processing the data in chunks, you can load only part of the file into memory at any given time. And that means you can process files that don’t fit in memory.
Let’s see how you can do this with Pandas.
Reading the full file
We’ll start with a program that just loads a full CSV into memory. In particular, we’re going to write a little program that loads a voter registration database, and measures how many voters live on every street in the city:
import pandas voters_street = pandas.read_csv( "voters.csv")["Residential Address Street Name "] print(voters_street.value_counts())
If we run it we get:
$ python voter-by-street-1.py MASSACHUSETTS AVE 2441 MEMORIAL DR 1948 HARVARD ST 1581 RINDGE AVE 1551 CAMBRIDGE ST 1248 ... NEAR 111 MOUNT AUBURN ST 1 SEDGEWICK RD 1 MAGAZINE BEACH PARK 1 WASHINGTON CT 1 PEARL ST AND MASS AVE 1 Name: Residential Address Street Name , Length: 743, dtype: int64
Where is memory being spent? As you would expect, the bulk of memory usage is allocated by loading the CSV into memory.
In the following graph of peak memory usage, the width of the bar indicates what percentage of the memory is used:
- The section on the left is the CSV read.
- The narrower section on the right is memory used importing all the various Python modules, in particular Pandas; unavoidable overhead, basically.
You don’t have to read it all
As an alternative to reading everything into memory, Pandas allows you to read data in chunks. In the case of CSV, we can load only some of the lines into memory at any given time.
In particular, if we use the
chunksize argument to
pandas.read_csv, we get back an iterator over
DataFrames, rather than one single
DataFrame is the next 1000 lines of the CSV:
import pandas result = None for chunk in pandas.read_csv("voters.csv", chunksize=1000): voters_street = chunk[ "Residential Address Street Name "] chunk_result = voters_street.value_counts() if result is None: result = chunk_result else: result = result.add(chunk_result, fill_value=0) result.sort_values(ascending=False, inplace=True) print(result)
When we run this we get basically the same results:
$ python voter-by-street-2.py MASSACHUSETTS AVE 2441.0 MEMORIAL DR 1948.0 HARVARD ST 1581.0 RINDGE AVE 1551.0 CAMBRIDGE ST 1248.0 ...
If we look at the memory usage, we’ve reduced memory usage so much that the memory usage is now dominated by importing Pandas; the actual code barely uses anything:
The MapReduce idiom
Taking a step back, what we have here is an highly simplified instance of the MapReduce programming model. While typically used in distributed systems, where chunks are processed in parallel and therefore handed out to worker processes or even worker machines, you can still see it at work in this example.
In the simple form we’re using, MapReduce chunk-based processing has just two steps:
- For each chunk you load, you map or apply a processing function.
- Then, as you accumulate results, you “reduce” them by combining partial results into the final result.
We can re-structure our code to make this simplified MapReduce model more explicit:
import pandas from functools import reduce def get_counts(chunk): voters_street = chunk[ "Residential Address Street Name "] return voters_street.value_counts() def add(previous_result, new_result): return previous_result.add(new_result, fill_value=0) # MapReduce structure: chunks = pandas.read_csv("voters.csv", chunksize=1000) processed_chunks = map(get_counts, chunks) result = reduce(add, processed_chunks) result.sort_values(ascending=False, inplace=True) print(result)
Both reading chunks and
map() are lazy, only doing work when they’re iterated over.
As a result, chunks are only loaded in to memory on-demand when
reduce() starts iterating over
From full reads to chunked reads
You’ll notice in the code above that
get_counts() could just as easily have been used in the original version, which read the whole CSV into memory:
def get_counts(chunk): voters_street = chunk[ "Residential Address Street Name "] return voters_street.value_counts() result = get_counts(pandas.read_csv("voters.csv"))
That’s because reading everything at once is a simplified version of reading in chunks: you only have one chunk, and therefore don’t need a reducer function.
So here’s how you can go from code that reads everything at once to code that reads in chunks:
- Separate the code that reads the data from the code that processes the data.
- Use the new processing function, by mapping it across the results of reading the file chunk-by-chunk.
- Figure out a reducer function that can combine the processed chunks into a final result.
Learn even more techniques for reducing memory usage—read the rest of the Small Big Data guide for Python.
Wasting compute money on processes that use too much memory?
Your Python batch process is using too much memory, and you have no idea which part of your code is responsible.
You need a tool that will tell you exactly where to focus your optimization efforts, a tool designed for data scientists and scientists. Learn how the Fil memory profiler can help you.
Level up your job skills and become a better data scientist
You're not a software engineer, but you still have to deal with everything from bugs to slow code to mysterious errors. Writing software that's maintainable, fast, and easy-to-understand would make you a better data scientists (not to mention more employable).
Subscribe to my newsletter, and every week you’ll get new articles showing you how to improve you software engineering skills, from testing to packaging to performance: