Scaling data processing—without a cluster!

Table of Contents

As described in Alex Voss, Ilia Lvov, and Jon Lewis’s Small Big Data manifesto, you don’t need a Big Data cluster to process large amounts of data; a single computer is often sufficient.

Process large datasets without running out of memory

Lacking CPU, your program runs slower; lacking memory, your program crashes. But you can process larger-than-RAM datasets in Python, as you’ll learn in the following series of articles.

Code structure

  1. Copying data is wasteful, mutating data is dangerous
    Copying data wastes memory, and modifying or mutating data in-place can lead to bugs. A compromise between the two is “hidden mutability”.

  2. Clinging to memory: how Python function calls can increase your memory usage
    Python will automatically release memory for objects that aren’t being used. But sometimes function calls can unexpectedly keep objects in memory. Learn about Python memory management, how it interacts with function calls, and what you can do about it.

  3. Massive memory overhead: Numbers in Python and how NumPy helps
    Storing integers or floats in Python has a huge overhead in memory. Learn why, and how NumPy makes things better.

  4. Too many objects: Reducing memory overhead from Python instances
    Objects in Python have large memory overhead; create too many objects, and you’ll use far more memory than you expect. Learn why, and what do about it.

Data management techniques

  1. Estimating and modeling memory requirements for data processing
    How much memory does your process actually need? How much will it need for different inputs? Learn to how measure and model memory usage for data processing batch jobs.

  2. When your data doesn’t fit in memory: the basic techniques
    You can still process data that doesn’t fit in memory by using four basic techniques: spending money, compression, chunking, and indexing.

Pandas

  1. Reducing Pandas memory usage #1: lossless compression
    How do you load a large CSV into Pandas without using as much memory? Learn the basic techniques: dropping columns, lower-range numeric dtypes, categoricals, and sparse columns.

  2. Reducing Pandas memory usage #2: lossy compression
    In this article you’ll learn techniques that lose some details in return for reducing memory usage.

  3. Reducing Pandas memory usage #3: Reading in chunks
    By loading and then processing a file into Pandas in chunks, you can load only part of the file into memory at any given time.

  4. Fast subsets of large datasets with Pandas and SQLite
    You have a large amount of data, and you want to load only part into memory as a Pandas dataframe. CSVs won’t cut it: you need a database, and the easiest way to do that is with SQLite.

  5. From chunking to parallelism: faster Pandas with Dask
    Processing your data in chunks lets you reduce memory usage, but it can also speed up your code. Because each chunk can be processed independently, you can process them in parallel, utilizing multiple CPUs. For Pandas (and NumPy), Dask is a great way to do this.

NumPy

  1. Reducing NumPy memory usage with lossless compression
    By changing how you represent your NumPy arrays, you can significantly reduce memory usage: by choosing smaller dtypes, and using sparse arrays. You’ll also learn about cases where this won’t help.

  2. Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5
    If your NumPy array doesn’t fit in memory, you can load it transparently from disk using either mmap() or the very similar Zarr and HDF5 file formats. Here’s what they do, and why you’d choose one over the other.

  3. The mmap() copy-on-write trick: reducing memory usage of array copies
    Usually, copying an array and modifying it doubles the memory usage. But by utilizing the operating system’s mmap() call, you only pay the cost for the parts of the copy that you changed.

Tools

  1. Fil: a new Python memory profiler for data scientists and scientists
    Fil is a new memory profiler which shows you peak memory usage, and where that memory was allocated. It’s designed specifically for the needs of data scientists and scientists running data processing pipelines.

  2. Debugging out-of-memory crashes in Python
    Debugging out-of-memory crashes can be tricky. Learn how the Fil memory profiler can help you find where your memory use is happening.

  3. Debugging Python server memory leaks with the Fil profiler
    When your server is leaking memory, the Fil memory profiler can help you spot the buggy code.



How do you process large datasets with limited memory?

Get a free cheatsheet summarizing how to process large amounts of data with limited memory using Python, NumPy, and Pandas.

Plus, every week or so you’ll get new articles showing you how to process large data, and more generally improve you software engineering skills, from testing to packaging to performance:



Speed up your code

What does speed mean?

  1. Optimizing your code is not the same as parallelizing your code
    To make your Python code faster, you should often start with optimizing single-threaded versions, then consider multiprocessing, and only then think about a cluster.

  2. Speed is situational: two websites, two orders of magnitude
    How do you make your application fast? It depends every much on your particular case, as you’ll see in this example case study.

Measuring performance

  1. Where’s your bottleneck? CPU time vs wallclock time
    Slow runtime isn’t necessarily CPU: other causes include I/O, locks, and more. Learn a quick heuristic to help you identify which it is.

  2. Beyond cProfile: Choosing the right tool for performance optimization
    There are different profilers you can use to measure Python performance. Learn about cProfile, sampling profilers, and logging, and when to use each.
    (Originally a PyGotham 2019 talk—you can also watch a video)

  3. Not just CPU: writing custom profilers for Python
    Sometimes existing profilers aren’t enough: you need to measure something unusual. With Python’s built-in profiling library, you can write your own.

  4. Logging for scientific computing: debugging, performance, trust
    Logging is a great way to understand your code, make it faster, and help you convince yourself and others that you can trust the results. Learn why, with examples built on the Eliot logging library.
    (Originally a PyCon 2019 talk—you can also watch a video)

Multiprocessing

  1. Why your multiprocessing Pool is stuck (it’s full of sharks!)
    Python’s multiprocessing library is an easy way to use multiple CPUs, but its default configuration can lead to deadlocks and brokenness. Learn why, and how to fix it.

  2. The Parallelism Blues: when faster code is slower
    By default NumPy uses multiple CPUs for certain operations. But sometimes that can actually slow down your code.

Libraries

  1. Choosing a faster JSON library for Python
    There are multiple JSON encoding/decoding libraries available for Python. Learn how you can choose the fastest for your particular use case.


How do you process large datasets with limited memory?

Get a free cheatsheet summarizing how to process large amounts of data with limited memory using Python, NumPy, and Pandas.

Plus, every week or so you’ll get new articles showing you how to process large data, and more generally improve you software engineering skills, from testing to packaging to performance: