Debugging out-of-memory crashes in Python

You run your program, and it crashes—it’s out of memory:

  • If you’re lucky, you get a MemoryError exception.
  • If you’re less lucky, you get a coredump.
  • If you’re having a bad day, your computer locks up and you need to restart it.

How do you figure out what is using up all your Python program’s memory?

One way to do that is with the Fil memory profiler, which specifically—albeit experimentally—supports debugging out-of-memory crashes. But before we see how that works, let’s consider why out-of-memory situations are so painful.

The unpleasant experience of running out of memory

If your program is maxing out your computer’s CPU, your program will just run slower—it will eventually finish. Run out of memory, however, and you’re not going to be so lucky. When you’re out of memory, the program is first going to become extremely slow, and then it’s going exit—often badly.

Your operating system will often be configured to write out unused parts of memory to disk—”swapping” to disk—so other programs can get that memory. This is useful when you’re running multiple applications, like a browser and a text editor. If you’re not currently using the browser, the OS can write out its memory to disk, and only load it when you switch back.

Eventually even swapping isn’t enough, and there is no memory left whatsoever: memory allocation fails.

And then—

  1. If it reached Python, Python will try to throw a MemoryError exception. However, throwing an exception, printing it out, handling that error: all of these may require allocating memory too! So this may well not work, leading to an even less informative crash.
  2. If it’s C code that doesn’t handle failed allocations very well, you might end up deferencing an invalid address and segfaulting.
  3. In some cases the operating system, e.g. the Linux OOM killer, will kill the process.
  4. Whatever the lead up to a crash, your operating system might end up trying to write out a core dump file of the crashed process. Writing out a core dump also takes memory. At this point swapping might get so bad that your computer is essentially dead, and you’ll need to manually reboot it.

In the worst-case scenario, figuring out why your process used so much memory is extremely difficult. But even if you got a nice MemoryError traceback, you’ll know what straw broke the camel’s back—but you won’t know where the rest of the stuff on the camel’s back came from.

When your memory runs out, it’s quite difficult to figure out where your program allocated all the memory up to that point. And that memory is the likely culprit.

Using Fil to debug out-of-memory crashes

To help you debug these situations, the Fil memory profiler includes (experimental) support for dumping out current memory allocations at the time of a crash. Let’s see how this works.

Consider the following Python program:

import numpy as np

ALLOCATIONS = []

def add1(x):
    ALLOCATIONS.append(np.ones((1024 * 1024 * x)))

def add2():
    add1(5)
    add1(2)

def main():
    while True:
        add2()
        add1(3)
        x = np.ones((1024 * 1024,))

main()

When I run this program the process is killed, likely by the Linux out-of-memory killer. No traceback is printed.

$ python oom.py
Killed

Now, in this case the program is simple enough that you can figure out the memory leak from reading it, but real programs won’t be so easy. So what you want is a tool to help you debug the situation, a tool like the Fil memory profiler.

Let’s see how you use Fil to debug this.

First, install Fil (Linux and macOS only at the moment) either with pip inside a virtualenv:

$ pip install --upgrade pip
$ pip install filprofiler

Or with Conda:

$ conda install -c conda-forge filprofiler

Next, we want to make memory allocation fail a little bit earlier, before the process is terminated by the Linux OOM killer. We can use the ulimit tool to limit how much memory can be allocated to the process.

We can run free to figure out how much memory is available—in this case about 6.3GB—and then set a corresponding limit on virtual memory:

$ free -h
       total   used   free  shared  buff/cache  available
Mem:   7.7Gi  1.1Gi  6.3Gi    50Mi       334Mi      6.3Gi
Swap:  3.9Gi  3.0Gi  871Mi
$ ulimit -Sv 6300000

If I run the program directly I now get a MemoryError, but that’s still not enough to know where all the memory allocations came from:

$ python oom.py 
Traceback (most recent call last):
  File "oom.py", line 18, in <module>
    main()
  File "oom.py", line 14, in main
    add2()
  File "oom.py", line 10, in add2
    add1(2)
  File "oom.py", line 6, in add1
    ALLOCATIONS.append(np.ones((1024 * 1024 * x)))
  File "/home/itamarst/Devel/sandbox/oom/venv/lib64/python3.7/site-packages/numpy/core/numeric
.py", line 207, in ones
    a = empty(shape, dtype, order)
MemoryError: Unable to allocate 16.0 MiB for an array with shape (2097152,) and data type float64

So now I run the program under Fil:

$ fil-profile run oom.py 
...
=fil-profile= Wrote memory usage flamegraph to fil-result/2020-06-15T12:37:13.033/out-of-memory.svg
=fil-profile= Wrote memory usage flamegraph to fil-result/2020-06-15T12:37:13.033/out-of-memory-reversed.svg

Here’s what out-of-memory.svg looks like:

As you can see, this shows exactly where all the memory came from at the time the process ran out of memory. Which means you now have a starting point for reducing that memory usage.

Memory use too high? Try Fil

Fil can help you figure out where your crashing program is allocating its memory. But it can also help you with non-crashing programs, by measuring peak usage of your data processing program.

Once you’ve measured memory use and know where it’s coming from, you can start applying a variety of techniques to reduce memory usage.


Learn even more techniques for reducing memory usage—read the rest of the Small Big Data guide for Python.