The mmap() copy-on-write trick: reducing memory usage of array copies
Let’s say you have an array, and you need to make some copies and modify those copies. Usually, memory usage scales with the number of copies: if your original array was 1GB of RAM, each copy will take 1GB of RAM. And that can add up.
But often, you’re just changing a small part of the array. Ideally, the memory cost would only be the parts of the copies that you changed.
As it turns out, there is an operating system facility that enables this:
mmap()’s copy-on-write functionality.
In this article you will learn:
- How normal memory copies work.
- How to use
mmap()copy-on-write with NumPy.
- How the underlying
mmap()copy-on-write mechanism works, and why it can be more efficient.
The problem with copying
If you want to modify a copy of an array, the normal approach is to allocate more memory and copy the contents of the original array into the new chunk of memory. For example:
>>> import numpy, psutil >>> def memory_usage(): ... current_process = psutil.Process() ... memory = current_process.memory_info().rss ... print(int(memory / (1024 * 1024)), "MB") ... >>> array1 = numpy.ones((1024, 1024, 50)) >>> memory_usage() 428 MB >>> array2 = array1.copy() >>> memory_usage() 827 MB
In visual form, the allocated memory looks like this:
The pages are chunks of 4KB that are the unit of memory management for the operating system.
Saving memory with copy-on-write
In an ideal world, that second array would only store the differences from the first array: insofar as differences are few, the additional memory usage would be small.
And that’s where
mmap()’s copy-on-write functionality comes in (or the equivalent API on Windows; NumPy wraps them both).
If you’re not familiar with
mmap(), see my overview comparing
mmap() with HDF5 and Zarr.
mmap() in this mode, we need a backing file.
While there is a file involved, so long as there’s enough memory available the file is almost an implementation detail; it needs to be there but it won’t impact performance much.
Note: On Linux you can go one step further and create an in-memory file using the
memfd_createAPI, which can be used in Python 3.8 and later by doing
os.fdopen(os.memfd_create("mymemfile"), "rb+")and then “truncating” the file to be the appropriate size.
numpy.lib.format.open_memmap() function will open a file of the appropriate size; we’ll start by creating our initial array:
>>> del array1, array2 >>> memory_usage() 20 MB >>> open_memmap = numpy.lib.format.open_memmap >>> mmap_array1 = open_memmap("/tmp/myarray", mode="w+", shape=(1024, 1024, 50)) >>> memory_usage() 22 MB >>> mmap_array1[:] = 1 >>> mmap_array1 = 10 >>> memory_usage() 422 MB
Initially the array is just zeroes (at least on Linux and macOS; Windows may differ), so the operating system is clever enough not to allocate any new memory. Once we set some values, memory usage goes up accordingly.
Next, let’s create a copy: we’ll
mmap() the same file with
mode="c", which means copy-on-write.
On Unix systems like Linux or macOS, this translates to the
MAP_PRIVATE flag to the
>>> mmap_array2 = open_memmap("/tmp/myarray", mode="c", shape=(1024, 1024, 50)) >>> mmap_array2[0, 0, 0] 10.0 >>> mmap_array2[10, 0, 1] 1.0 >>> memory_usage() 422 MB
We now have another copy of the array, with the same contents… but memory usage hasn’t changed!
Now let’s modify that second array, and we’ll see how memory usage goes up, but the original array is unchanged.
>>> mmap_array2[1:100] = 30 >>> memory_usage() 461 MB >>> mmap_array1[1, 0, 0] 1.0
We have successfully made a copy of an array that:
- Doesn’t change the original array when mutated.
- Only stores those parts of the copy that have changed from the original, allowing us to save memory.
How copy-on-write works
mmap() a file with the
MAP_PRIVATE flag, here’s what happens per the manpage:
MAP_PRIVATE Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
Notice that changes made to the file may or may not be visible, that behavior is unspecified. As a result, it’s best not to modify the original array.
Returning to our goal, we are saving memory by using copy-on-write. That means pages in the second array point to the first array until some change is made to them. Only when you write to the page does a copy get made and the writes applied.
MAP_PRIVATE (by using
mode="c"), and memory looked like this:
That is, we had another array, but no extra memory was used.
Then, we made some changes to part of the second array. Those pages that were modified get copied, and then modified—the rest still point to the original array. For example, if we modified some data in the first 4096 bytes in the array’s in-memory representation, a new page would be allocated that is a copy of the one in the first array:
Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.
Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.
Paying only for what you change
mmap() copy-on-write trick is useful when:
- You have a very large array.
- You are making copies and only partially modifying those copies.
In this situation, copy-on-write saves memory by only allocating memory for data that has actually changed. Just make sure not to modify the original array; you may have unexpected consequences depending on your operating system.
For other data structures, like dictionaries or lists, you can use immutable datastructures to reduce memory usage of mostly-similar copies; in Python the
pyrsistent library is one implementation.
Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.
Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler
Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.
Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support, on macOS and Linux, with built-in support for Jupyter notebooks.
Learn practical Python software engineering skills you can use at your job
Sign up for my newsletter, and join over 6900 Python developers and data scientists learning practical tools and techniques, from Python performance to Docker packaging, with a free new article in your inbox every week.