Process large datasets without running out of memory
Table of Contents
- Code structure
- Data management techniques
- Measuring memory usage
Lacking CPU, your program runs slower; lacking memory, your program crashes. But you can process larger-than-RAM datasets in Python, as you’ll learn in the following series of articles.
Copying data is wasteful, mutating data is dangerous
Copying data wastes memory, and modifying/mutating data can lead to bugs. Learn how to implement a compromise between the two in Python: hidden mutability.
Clinging to memory: how Python function calls can increase memory use
Python will automatically free objects that aren’t being used. Sometimes function calls can unexpectedly keep objects in memory; learn why, and how to fix it.
Massive memory overhead: Numbers in Python and how NumPy helps
Storing integers or floats in Python has a huge overhead in memory. Learn why, and how NumPy makes things better.
Too many objects: Reducing memory overhead from Python instances
Objects in Python have large memory overhead. Learn why, and what do about it: avoiding dicts, fewer objects, and more.
Data management techniques
Estimating and modeling memory requirements for data processing
Learn to how measure and model memory usage for Python data processing batch jobs based on input size.
When your data doesn’t fit in memory: the basic techniques
You can process data that doesn’t fit in memory by using four basic techniques: spending money, compression, chunking, and indexing.
Processing large JSON files in Python without running out of memory
Loading complete JSON files into Python can use too much memory, leading to slowness or crashes. The solution: process JSON data one chunk at a time.
Measuring the memory usage of a Pandas DataFrame
Learn how to accurately measure memory usage of your Pandas DataFrame or Series.
Reducing Pandas memory usage #1: lossless compression
Load a large CSV or other data into Pandas using less memory with techniques like dropping columns, smaller numeric dtypes, categoricals, and sparse columns.
Reducing Pandas memory usage #2: lossy compression
Reduce Pandas memory usage by dropping details or data that aren’t as important.
Reducing Pandas memory usage #3: Reading in chunks
Reduce Pandas memory usage by loading and then processing a file in chunks rather than all at once.
Fast subsets of large datasets with Pandas and SQLite
You have a large amount of data, and you want to load only part into memory as a Pandas dataframe. One easy way to do it: indexing via SQLite database.
Loading SQL data into Pandas without running out of memory
Pandas can load data from a SQL query, but the result may use too much memory. Learn how to process data in batches, and reduce memory usage even further.
Saving memory with Pandas 1.3’s new string dtype
Storing strings in Pandas can use a lot of memory, but with Pandas 1.3 you have access to a newer, more efficient option.
From chunking to parallelism: faster Pandas with Dask
Learn how Dask can both speed up your Pandas data processing with parallelization, and reduce memory usage with transparent chunking.
Reducing NumPy memory usage with lossless compression
Reduce NumPy memory usage by choosing smaller dtypes, and using sparse arrays.
NumPy views: saving memory, leaking memory, and subtle bugs
NumPy uses memory views transparently, as a way to save memory. But you need to understand how they work, so you don’t leak memory, or modify data by mistake.
Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5
If your NumPy array is larger than memory, you can load it transparently from disk using either mmap() or the very similar Zarr and HDF5 file formats.
The mmap() copy-on-write trick: reducing memory usage of array copies
Copying a NumPy array and modifying it doubles the memory usage. But by utilizing the operating system’s mmap() call, you can only pay for what you modify.
Measuring memory usage
Measuring memory usage in Python: it’s tricky!
Measuring your Python program’s memory usage is not as straightforward as you might think. Learn two techniques, and the tradeoffs between them.
Fil: a new Python memory profiler for data scientists and scientists
Fil is a Python memory profiler designed specifically for the needs of data scientists and scientists running data processing pipelines.
Debugging Python out-of-memory crashes with the Fil profiler
Debugging Python out-of-memory crashes can be tricky. Learn how the Fil memory profiler can help you find where your memory use is happening.
Dying, fast and slow: out-of-memory crashes in Python
There are many ways Python out-of-memory problems can manifest: slowness due to swapping, crashes, MemoryError, segfaults, kill -9.
Debugging Python server memory leaks with the Fil profiler
When your Python server is leaking memory, the Fil memory profiler can help you spot the buggy code.
How do you process large datasets with limited memory?
Get a free cheatsheet summarizing how to process large amounts of data with limited memory using Python, NumPy, and Pandas.
Plus, every week or so you’ll get new articles showing you how to process large data, and more generally improve you software engineering skills, from testing to packaging to performance:
Fil: A memory profiler for Python (open source)
Paying too much for you compute resources because your Python batch process uses too much memory? The free, open source Fil memory profiler will tell you exactly what you need to know: where your peak memory usage is coming from, so you can optimize your code and lower you costs.
Sciagraph™: Always-on performance and memory profiling for production batch jobs ($)
If your production data processing batch jobs are running too slowly, using too much memory, or costing too much, you need to understand why. Sciagraph is an always-on, production-grade profiler you can use to get immediate insights into your code’s bottlenecks.