The mmap() copy-on-write trick: reducing memory usage of array copies

by Itamar Turner-Trauring
Last updated 12 Jan 2023, originally created 17 Sep 2020

Let’s say you have an array, and you need to make some copies and modify those copies. Usually, memory usage scales with the number of copies: if your original array was 1GB of RAM, each copy will take 1GB of RAM. And that can add up.

But often, you’re just changing a small part of the array. Ideally, the memory cost would only be the parts of the copies that you changed.

As it turns out, there is an operating system facility that enables this: mmap()’s copy-on-write functionality.

In this article you will learn:

How normal memory copies work.
How to use mmap() copy-on-write with NumPy.
How the underlying mmap() copy-on-write mechanism works, and why it can be more efficient.

The problem with copying

If you want to modify a copy of an array, the normal approach is to allocate more memory and copy the contents of the original array into the new chunk of memory. For example:

>>> import numpy, psutil
>>> def memory_usage():
...     current_process = psutil.Process()
...     memory = current_process.memory_info().rss
...     print(int(memory / (1024 * 1024)), "MB")
...
>>> array1 = numpy.ones((1024, 1024, 50))
>>> memory_usage()
428 MB
>>> array2 = array1.copy()
>>> memory_usage()
827 MB

In visual form, the allocated memory looks like this:

The pages are chunks of 4KB that are the unit of memory management for the operating system.

Saving memory with copy-on-write

In an ideal world, that second array would only store the differences from the first array: insofar as differences are few, the additional memory usage would be small. And that’s where mmap()’s copy-on-write functionality comes in (or the equivalent API on Windows; NumPy wraps them both).

If you’re not familiar with mmap(), see my overview comparing mmap() with HDF5 and Zarr.

To use mmap() in this mode, we need a backing file. While there is a file involved, so long as there’s enough memory available the file is almost an implementation detail; it needs to be there but it won’t impact performance much.

Note: On Linux you can go one step further and create an in-memory file using the memfd_create API, which can be used in Python 3.8 and later by doing os.fdopen(os.memfd_create("mymemfile"), "rb+") and then “truncating” the file to be the appropriate size.

The numpy.lib.format.open_memmap() function will open a file of the appropriate size; we’ll start by creating our initial array:

>>> del array1, array2
>>> memory_usage()
20 MB
>>> open_memmap = numpy.lib.format.open_memmap
>>> mmap_array1 = open_memmap("/tmp/myarray", mode="w+", shape=(1024, 1024, 50))
>>> memory_usage()
22 MB
>>> mmap_array1[:] = 1
>>> mmap_array1[0] = 10
>>> memory_usage()
422 MB

Initially the array is just zeroes (at least on Linux and macOS; Windows may differ), so the operating system is clever enough not to allocate any new memory. Once we set some values, memory usage goes up accordingly.

Next, let’s create a copy: we’ll mmap() the same file with mode="c", which means copy-on-write. On Unix systems like Linux or macOS, this translates to the MAP_PRIVATE flag to the mmap() API.

>>> mmap_array2 = open_memmap("/tmp/myarray", mode="c", shape=(1024, 1024, 50))
>>> mmap_array2[0, 0, 0]
10.0
>>> mmap_array2[10, 0, 1]
1.0
>>> memory_usage()
422 MB

We now have another copy of the array, with the same contents… but memory usage hasn’t changed!

Now let’s modify that second array, and we’ll see how memory usage goes up, but the original array is unchanged.

>>> mmap_array2[1:100] = 30
>>> memory_usage()
461 MB
>>> mmap_array1[1, 0, 0]
1.0

We have successfully made a copy of an array that:

Doesn’t change the original array when mutated.
Only stores those parts of the copy that have changed from the original, allowing us to save memory.

How copy-on-write works

When we mmap() a file with the MAP_PRIVATE flag, here’s what happens per the manpage:

MAP_PRIVATE
Create a private copy-on-write mapping.  Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file. It
is unspecified whether changes made to the file after the
mmap() call are visible in the mapped region.

Notice that changes made to the file may or may not be visible, that behavior is unspecified. As a result, it’s best not to modify the original array.

Returning to our goal, we are saving memory by using copy-on-write. That means pages in the second array point to the first array until some change is made to them. Only when you write to the page does a copy get made and the writes applied.

Initially we mmap()ed /tmp/myarray with MAP_PRIVATE (by using mode="c"), and memory looked like this:

That is, we had another array, but no extra memory was used.

Then, we made some changes to part of the second array. Those pages that were modified get copied, and then modified—the rest still point to the original array. For example, if we modified some data in the first 4096 bytes in the array’s in-memory representation, a new page would be allocated that is a copy of the one in the first array:

Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.

Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.

Paying only for what you change

The mmap() copy-on-write trick is useful when:

You have a very large array.
You are making copies and only partially modifying those copies.

In this situation, copy-on-write saves memory by only allocating memory for data that has actually changed. Just make sure not to modify the original array; you may have unexpected consequences depending on your operating system.

For other data structures, like dictionaries or lists, you can use immutable datastructures to reduce memory usage of mostly-similar copies; in Python the pyrsistent library is one implementation.

Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.

Consulting services: take your code from prototype to production

You have a working Python prototype for your data processing algorithm. Now you need to get it ready for production. Which means your software needs to be fast, robust, maintainable, cost-efficient, and scalable.

With more than 25 years experience of shipping software to production, I can help you:

Speed up your code so it can get results on time, and run at scale with an affordable operating budget.

Learn about tools, techniques, and process improvements that will help you ship best-practices software, on schedule.

To get in touch about consulting services, send me an email at itamar@pythonspeed.com.

Speed up your Python code and learn skills you can use at your job

Join over 8000 Python developers and data scientists learning practical tools and techniques every week, from Python performance to Docker packaging, by signing up for my newsletter.