Software Engineering for Data Scientists

Small Big Data: large data on a single computer

As described in Alex Voss, Ilia Lvov, and Jon Lewis’s Small Big Data manifesto, you don’t need a Big Data cluster to process large amounts of data; a single computer is often sufficient. In this planned series of articles you’ll learn the relevant principles and techniques, and how to apply them to tools like NumPy and Pandas.


  1. When your data doesn’t fit in memory: the basic techniques
    You can still process data that doesn’t fit in memory by using four basic techniques: spending money, compression, chunking, and indexing.

  2. Copying data is wasteful, mutating data is dangerous
    Copying data wastes memory, and modifying or mutating data in-place can lead to bugs. A compromise between the two is “hidden mutability”.


  1. Reducing Pandas memory usage #1: lossless compression
    How do you load a large CSV into Pandas without using as much memory? Learn the basic techniques: dropping columns, lower-range numeric dtypes, categoricals, and sparse columns.

  2. Reducing Pandas memory usage #2: lossy compression
    In this article you’ll learn techniques that lose some details in return for reducing memory usage.

  3. Reducing Pandas memory usage #3: Reading in chunks
    By loading and then processing a file into Pandas in chunks, you can load only part of the file into memory at any given time.


  1. Reducing NumPy memory usage with lossless compression
    By changing how you represent your NumPy arrays, you can significantly reduce memory usage: by choosing smaller dtypes, and using sparse arrays. You’ll also learn about cases where this won’t help.

Performance optimization

Tired of hacking software together, only to have it break after a month?

You write software to analyze data, but you never learned any CS or software engineering. Yet you still have to deal with everything from bugs to slow code to mysterious errors.

You want to write software the way it’s supposed to be done: maintainable, fast, and easy-to-understand.

Subscribe to my newsletter, and every week you’ll get new articles showing you how to improve you software engineering skills, from testing to packaging to performance: