Software Engineering for Data Scientists
Small Big Data: large data on a single computer
As described in Alex Voss, Ilia Lvov, and Jon Lewis’s Small Big Data manifesto, you don’t need a Big Data cluster to process large amounts of data; a single computer is often sufficient. In this planned series of articles you’ll learn the relevant principles and techniques, and how to apply them to tools like NumPy and Pandas.
When your data doesn’t fit in memory: the basic techniques
You can still process data that doesn’t fit in memory by using four basic techniques: spending money, compression, chunking, and indexing.
Copying data is wasteful, mutating data is dangerous
Copying data wastes memory, and modifying or mutating data in-place can lead to bugs. A compromise between the two is “hidden mutability”.
Reducing Pandas memory usage #1: lossless compression
How do you load a large CSV into Pandas without using as much memory? Learn the basic techniques: dropping columns, lower-range numeric dtypes, categoricals, and sparse columns.
Reducing Pandas memory usage #2: lossy compression
In this article you’ll learn techniques that lose some details in return for reducing memory usage.
Reducing Pandas memory usage #3: Reading in chunks
By loading and then processing a file into Pandas in chunks, you can load only part of the file into memory at any given time.
Fast subsets of large datasets with Pandas and SQLite
You have a large amount of data, and you want to load only part into memory as a Pandas dataframe. CSVs won’t cut it: you need a database, and the easiest way to do that is with SQLite.
From chunking to parallelism: faster Pandas with Dask
Processing your data in chunks lets you reduce memory usage, but it can also speed up your code. Because each chunk can be processed independently, you can process them in parallel, utilizing multiple CPUs. For Pandas (and NumPy), Dask is a great way to do this.
Reducing NumPy memory usage with lossless compression
By changing how you represent your NumPy arrays, you can significantly reduce memory usage: by choosing smaller dtypes, and using sparse arrays. You’ll also learn about cases where this won’t help.
Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5
If your NumPy array doesn’t fit in memory, you can load it transparently from disk using either mmap() or the very similar Zarr and HDF5 file formats. Here’s what they do, and why you’d choose one over the other.
- Fil: a new Python memory profiler for data scientists and scientists
Fil is a new memory profiler which shows you peak memory usage, and where that memory was allocated. It’s designed specifically for the needs of data scientists and scientists running data processing pipelines.
Python batch process using too much memory? Tired to fighting bad tooling that doesn’t tell you what you want to know? The Fil memory profiler will tell you exactly what you need to know: where your peak memory usage is coming from.
Level up your job skills and become a better data scientist
You’re not a software engineer, but you still have to deal with everything from bugs to slow code to mysterious errors. Writing software that’s maintainable, fast, and easy-to-understand would make you a better data scientists (not to mention more employable).
Subscribe to my newsletter, and every week you’ll get new articles showing you how to improve you software engineering skills, from testing to packaging to performance: