Process large datasets without running out of memory

Table of Contents

Lacking CPU, your program runs slower; lacking memory, your program crashes. But you can process larger-than-RAM datasets in Python, as you’ll learn in the following series of articles.

Code structure

  1. Copying data is wasteful, mutating data is dangerous
    Copying data wastes memory, and modifying/mutating data can lead to bugs. Learn how to implement a compromise between the two in Python: hidden mutability.

  2. Clinging to memory: how Python function calls can increase memory use
    Python will automatically free objects that aren’t being used. Sometimes function calls can unexpectedly keep objects in memory; learn why, and how to fix it.

  3. Massive memory overhead: Numbers in Python and how NumPy helps
    Storing integers or floats in Python has a huge overhead in memory. Learn why, and how NumPy makes things better.

  4. Too many objects: Reducing memory overhead from Python instances
    Objects in Python have large memory overhead. Learn why, and what do about it: avoiding dicts, fewer objects, and more.

Data management techniques

  1. Estimating and modeling memory requirements for data processing
    Learn to how measure and model memory usage for Python data processing batch jobs based on input size.

  2. When your data doesn’t fit in memory: the basic techniques
    You can process data that doesn’t fit in memory by using four basic techniques: spending money, compression, chunking, and indexing.

Pandas

  1. Measuring the memory usage of a Pandas DataFrame
    Learn how to accurately measure memory usage of your Pandas DataFrame or Series.

  2. Reducing Pandas memory usage #1: lossless compression
    Load a large CSV or other data into Pandas using less memory with techniques like dropping columns, smaller numeric dtypes, categoricals, and sparse columns.

  3. Reducing Pandas memory usage #2: lossy compression
    Reduce Pandas memory usage by dropping details or data that aren’t as important.

  4. Reducing Pandas memory usage #3: Reading in chunks
    Reduce Pandas memory usage by loading and then processing a file in chunks rather than all at once.

  5. Fast subsets of large datasets with Pandas and SQLite
    You have a large amount of data, and you want to load only part into memory as a Pandas dataframe. One easy way to do it: indexing via SQLite database.

  6. Loading SQL data into Pandas without running out of memory
    Pandas can load data from a SQL query, but the result may use too much memory. Learn how to process data in batches, and reduce memory usage even further.

  7. Saving memory with Pandas 1.3’s new string dtype
    Storing strings in Pandas can use a lot of memory, but with Pandas 1.3 you have access to a newer, more efficient option.

  8. From chunking to parallelism: faster Pandas with Dask
    Learn how Dask can both speed up your Pandas data processing with parallelization, and reduce memory usage with transparent chunking.

NumPy

  1. Reducing NumPy memory usage with lossless compression
    Reduce NumPy memory usage by choosing smaller dtypes, and using sparse arrays.

  2. NumPy views: saving memory, leaking memory, and subtle bugs
    NumPy uses memory views transparently, as a way to save memory. But you need to understand how they work, so you don’t leak memory, or modify data by mistake.

  3. Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5
    If your NumPy array is larger than memory, you can load it transparently from disk using either mmap() or the very similar Zarr and HDF5 file formats.

  4. The mmap() copy-on-write trick: reducing memory usage of array copies
    Copying a NumPy array and modifying it doubles the memory usage. But by utilizing the operating system’s mmap() call, you can only pay for what you modify.

Measuring memory usage

  1. Measuring memory usage in Python: it’s tricky!
    Measuring your Python program’s memory usage is not as straightforward as you might think. Learn two techniques, and the tradeoffs between them.

  2. Fil: a new Python memory profiler for data scientists and scientists
    Fil is a Python memory profiler designed specifically for the needs of data scientists and scientists running data processing pipelines.

  3. Debugging Python out-of-memory crashes with the Fil profiler
    Debugging Python out-of-memory crashes can be tricky. Learn how the Fil memory profiler can help you find where your memory use is happening.

  4. Dying, fast and slow: out-of-memory crashes in Python
    There are many ways Python out-of-memory problems can manifest: slowness due to swapping, crashes, MemoryError, segfaults, kill -9.

  5. Debugging Python server memory leaks with the Fil profiler
    When your Python server is leaking memory, the Fil memory profiler can help you spot the buggy code.



How do you process large datasets with limited memory?

Get a free cheatsheet summarizing how to process large amounts of data with limited memory using Python, NumPy, and Pandas.

Plus, every week or so you’ll get new articles showing you how to process large data, and more generally improve you software engineering skills, from testing to packaging to performance:


Products

Fil: A memory profiler for Python (open source)

Paying too much for you compute resources because your Python batch process uses too much memory? The free, open source Fil memory profiler will tell you exactly what you need to know: where your peak memory usage is coming from, so you can optimize your code and lower you costs.

Reduce memory usage with Fil

Fil4prod: Always-on performance and memory profiling for production batch jobs ($)

If your production data processing batch jobs are running too slowly, using too much memory, or costing too much, you need to understand why. Fil4prod is an always-on, production-grade profiler you can use to get immediate insights into your code’s bottlenecks.

Speed up your production batch jobs with Fil4prod