The mmap() copy-on-write trick: reducing memory usage of array copies

Let’s say you have an array, and you need to make some copies and modify those copies. Usually, memory usage scales with the number of copies: if your original array was 1GB of RAM, each copy will take 1GB of RAM. And that can add up.

But often, you’re just changing a small part of the array. Ideally, the memory cost would only be the parts of the copies that you changed.

As it turns out, there is an operating system facility that enables this: mmap()’s copy-on-write functionality.

In this article you will learn:

  1. How normal memory copies work.
  2. How to use mmap() copy-on-write with NumPy.
  3. How the underlying mmap() copy-on-write mechanism works, and why it can be more efficient.

The problem with copying

If you want to modify a copy of an array, the normal approach is to allocate more memory and copy the contents of the original array into the new chunk of memory. For example:

>>> import numpy, psutil
>>> def memory_usage():
...     current_process = psutil.Process()
...     memory = current_process.memory_info().rss
...     print(int(memory / (1024 * 1024)), "MB")
>>> array1 = numpy.ones((1024, 1024, 50))
>>> memory_usage()
428 MB
>>> array2 = array1.copy()
>>> memory_usage()
827 MB

In visual form, the allocated memory looks like this:

G cluster_array1 Array 1 cluster_array2 Array 2 page1 Page 1 page2 Page 2 page3 Page 3 page4 ... page1b Page 1 page2b Page 2 page3b Page 3 page4b ...

The pages are chunks of 4KB that are the unit of memory management for the operating system.

Saving memory with copy-on-write

In an ideal world, that second array would only store the differences from the first array: insofar as differences are few, the additional memory usage would be small. And that’s where mmap()’s copy-on-write functionality comes in (or the equivalent API on Windows; NumPy wraps them both).

If you’re not familiar with mmap(), see my overview comparing mmap() with HDF5 and Zarr.

To use mmap() in this mode, we need a backing file. While there is a file involved, so long as there’s enough memory available the file is almost an implementation detail; it needs to be there but it won’t impact performance much.

Note: On Linux you can go one step further and create an in-memory file using the memfd_create API, which can be used in Python 3.8 and later by doing os.fdopen(os.memfd_create("mymemfile"), "rb+") and then “truncating” the file to be the appropriate size.

The numpy.lib.format.open_memmap() function will open a file of the appropriate size; we’ll start by creating our initial array:

>>> del array1, array2
>>> memory_usage()
20 MB
>>> open_memmap = numpy.lib.format.open_memmap
>>> mmap_array1 = open_memmap("/tmp/myarray", mode="w+", shape=(1024, 1024, 50))
>>> memory_usage()
22 MB
>>> mmap_array1[:] = 1
>>> mmap_array1[0] = 10
>>> memory_usage()
422 MB

Initially the array is just zeroes (at least on Linux and macOS; Windows may differ), so the operating system is clever enough not to allocate any new memory. Once we set some values, memory usage goes up accordingly.

Next, let’s create a copy: we’ll mmap() the same file with mode="c", which means copy-on-write. On Unix systems like Linux or macOS, this translates to the MAP_PRIVATE flag to the mmap() API.

>>> mmap_array2 = open_memmap("/tmp/myarray", mode="c", shape=(1024, 1024, 50))
>>> mmap_array2[0, 0, 0]
>>> mmap_array2[10, 0, 1]
>>> memory_usage()
422 MB

We now have another copy of the array, with the same contents… but memory usage hasn’t changed!

Now let’s modify that second array, and we’ll see how memory usage goes up, but the original array is unchanged.

>>> mmap_array2[1:100] = 30
>>> memory_usage()
461 MB
>>> mmap_array1[1, 0, 0]

We have successfully made a copy of an array that:

  1. Doesn’t change the original array when mutated.
  2. Only stores those parts of the copy that have changed from the original, allowing us to save memory.

How copy-on-write works

When we mmap() a file with the MAP_PRIVATE flag, here’s what happens per the manpage:

Create a private copy-on-write mapping.  Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file. It
is unspecified whether changes made to the file after the
mmap() call are visible in the mapped region.

Notice that changes made to the file may or may not be visible, that behavior is unspecified. As a result, it’s best not to modify the original array.

Returning to our goal, we are saving memory by using copy-on-write. That means pages in the second array point to the first array until some change is made to them. Only when you write to the page does a copy get made and the writes applied.

Initially we mmap()ed /tmp/myarray with MAP_PRIVATE (by using mode="c"), and memory looked like this:

G cluster_array1 Array 1 cluster_array2 Array 2 page1 Page 1 page2 Page 2 page3 Page 3 page4 ... page1b Page 1 page1b->page1 page2b Page 2 page2b->page2 page3b Page 3 page3b->page3 page4b ... page4b->page4

That is, we had another array, but no extra memory was used.

Then, we made some changes to part of the second array. Those pages that were modified get copied, and then modified—the rest still point to the original array. For example, if we modified some data in the first 4096 bytes in the array’s in-memory representation, a new page would be allocated that is a copy of the one in the first array:

G cluster_array1 Array 1 cluster_array2 Array 2 page1 Page 1 page2 Page 2 page3 Page 3 page4 ... page1b Page 1 page2b Page 2 page2b->page2 page3b Page 3 page3b->page3 page4b ... page4b->page4

Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.

Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.

A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage
A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks

Paying only for what you change

The mmap() copy-on-write trick is useful when:

  1. You have a very large array.
  2. You are making copies and only partially modifying those copies.

In this situation, copy-on-write saves memory by only allocating memory for data that has actually changed. Just make sure not to modify the original array; you may have unexpected consequences depending on your operating system.

For other data structures, like dictionaries or lists, you can use immutable datastructures to reduce memory usage of mostly-similar copies; in Python the pyrsistent library is one implementation.

Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.