Dying, fast and slow: out-of-memory crashes in Python
A segfaulting program might be the symptom of a bug in C code–or it might be that your process is running out of memory. Crashing is just one symptom of running out of memory. Your process might instead just run very slowly, your computer or VM might freeze, or your process might get silently killed. Sometimes if you’re lucky you might even get a nice traceback, but then again, you might not.
So how do you identify out-of-memory problems?
With some understanding of how memory works in your operating system and in Python, you can learn to identify all the different ways out-of-memory problems can manifest:
- A slow death.
- An obvious death.
- A corrupted death.
- Death by assassination.
A slow death: Swapping
When your computer’s RAM fills up, the operating system will start moving chunks of memory out of RAM and on to your disk, aka “swapping”. Specifically, it will try to move chunks of memory that aren’t being used. When some code tries to read or write to these chunks they will get loaded back into RAM.
Now, your disk is much slower than RAM, so this can lead to slowness if you’re doing a lot of swapping. At the extreme, your computer will technically still be running but for practical purposes will be completely locked up as the amount of data the operating is system is trying to read and write to disk exceeds the disk’s bandwidth. This is more common on personal computers, where you’re running many different programs that might use and touch a lot of memory: a browser, an IDE, the program you’re testing, and so on.
Even with swapping, if you allocate enough memory you’ll eventually run out of the combined RAM and swap space. At this point allocating more memory is impossible.
An obvious death: MemoryError tracebacks and other error messages
What happens when you can’t allocate any more memory?
When using Python, this will often result in the interpreter’s memory allocation APIs failing to allocate. At this point, Python will try to raise a MemoryError exception.
>>> import numpy
>>> numpy.ones((1_000_000_000_000,))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.9/site-packages/numpy/core/numeric.py", line 192, in ones
a = empty(shape, dtype, order)
MemoryError: Unable to allocate 7.28 TiB for an array with shape (1000000000000,) and data type float64
This will get handled by whatever mechanisms your program uses for unexpected exceptions; with any luck, it’ll end up in the logs or terminal for you to read.
Of course, handling and printing that traceback also uses memory. So this sort of useful, clear traceback is much more likely when you did a large allocation that didn’t fit in memory. The big allocation fails, which means available memory doesn’t change, and hopefully there’s enough available to handle the exception.
If there’s not enough memory to handle the error reporting my expectation is that you will eventually get some sort of crash, e.g. due to a stack overlow as you get an infinite recursion of creating a new MemoryError
failing due to lack of memory.
A disguised death: Segfaults in C code
Under the hood, Python eventually delegates memory allocation to the standard C library APIs malloc()
, free()
, and related functions.
When you call malloc()
, you get back the address of a newly allocated chunk of memory.
And if allocation fails you’ll get back NULL
, the address 0.
For example, here we see the first allocation returns the address of the newly allocated memory:
>>> import ctypes
>>> libc = ctypes.CDLL("libc.so.6")
>>> libc.malloc(100)
439108304
The second allocation fails because I asked for far too much memory:
>>> libc.malloc(1_000_000_000_000)
0
Now, well-behaved code will check for a NULL
returned from malloc()
, and handle the error as best it can.
It might exit with an error, or as we saw the Python interpreter raises a MemoryError
exception.
Buggy code will assume malloc()
always return successfully, and treat the 0 returned by malloc()
as a valid memory address.
Trying to write to address 0 will then result in a crash.
Consider the following C program:
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
char *p;
size_t allocation_size = strtoll(argv[1], &p, 10);
char* data = malloc(allocation_size);
/* Uh oh, didn't check for NULL return result. */
data[0] = 'O';
data[1] = 'K';
data[2] = '\n';
data[3] = 0;
printf("%s", data);
}
It allocates some memory based on the first argument, and then writes to it—but it doesn’t check for a failed allocation.
If I run it with a small, successful allocation, everything works fine:
$ ./naive-malloc 1000
OK
But if the allocation is too big, the program crashes because it doesn’t have any error handling code.
$ ./naive-malloc 10000000000000000
Segmentation fault (core dumped)
Of course, segfaults happen for other reasons as well, so to figure out the cause you’ll need to inspect the core file with a debugger like gdb
, or run the program under the Fil memory profiler.
Death by assassination: The out-of-memory killer
On Linux and macOS, there is another way your process might die: the operating system can decide your process is using too much memory, and kill it preemptively.
The symptom will be your program getting killed with SIGKILL (kill -9
), with a corresponding exit code.
- On Linux, you can see OOM killer logs in
dmesg
, and either/var/log/messages
or/var/log/kern.log
, depending on your distribution. Notifications are available via cgroups v1 or v2 (most distributions still use the former). - I am not sure where to find logs on macOS, but you can learn more about macOS out-of-memory handling and notifications here.
Debugging and preventing out-of-memory issues
Out-of-memory conditions can result in a variety of failure modes, from slowness to crashes, and the relevant information might end up in stderr
, the application-level logging, system-level logging, or implicit in a core dump file.
This makes debugging the cause of the problem rather tricky.
There are some ways to improve the situation, however.
- For production server workloads, setting memory limits and limiting swap will at least keep memory leaks from freezing things up—eventually the leaky process will get killed and restarted.
- For production batch data processing, you can model expected memory usage based on input size and then ensure you’re running with enough memory upfront.
- If you’re running processes manually, you can use the open source Fil memory profiler to automatically catch and debug Python out-of-memory problems.