The surprising way to save memory with BytesIO
If you need a file-like object that stores bytes in memory in Python, chances are you you’re using Pytho’s built-in io.BytesIO()
.
And since you’re already using an in-memory object, if your data is big enough you probably should try to save memory when reading that data back out.
After all, it’s better not to have two copies of all the data in memory when only one will suffice.
In this article we’ll cover:
- A quick intro to
BytesIO
. - The memory usage impacts of
BytesIO.read()
. - The two alternatives for accessing
BytesIO
data efficiently, and the tradeoffs between them.
So what’s a BytesIO
?
Python’s io.BytesIO
allows you to create a file-like object that stores bytes in memory:
from io import BytesIO
f = BytesIO()
f.write(b"hello ")
f.write(b"world")
f.seek(0)
assert f.read() == b"hello world"
The problem with BytesIO.read()
At some point you might want to access the data in the BytesIO
directly, as a bytes
or memoryview
object.
We’ll talk about what a memoryview
is and why you might want it a bit later.
As we saw above, since BytesIO
is a file-like object we can just use its read()
method to extract bytes
.
Unfortunately, this comes at the cost of doubling the amount of memory used.
To demonstrate the problem, we’ll write a utility function that measures how much extra memory we’ve allocated:
import gc
import tracemalloc
from contextlib import contextmanager
tracemalloc.start()
@contextmanager
def report_allocated(action: str):
print(action + ":")
gc.collect()
_, current = tracemalloc.get_traced_memory()
try:
yield
finally:
gc.collect()
_, new_current = tracemalloc.get_traced_memory()
print(
" Allocated additional",
round((new_current - current) / (1024 * 1024)),
"MiB\n",
)
We can then measure memory usage from creating a BytesIO
and then calling read()
:
with report_allocated("Creating BytesIO"):
f = BytesIO()
chunk = b"X" * (1024 * 1024)
for _ in range(50):
f.write(chunk)
f.seek(0)
with report_allocated("BytesIO.read()"):
data : bytes = f.read()
Here’s the output:
Creating BytesIO:
Allocated additional 57 MiB
BytesIO.read():
Allocated additional 50 MiB
As you can see, read()
creates a whole new copy of the data, using a lot more memory.
Can we do better?
BytesIO.getbuffer()
: getting a memoryview
of the data
One useful method BytesIO
has that regular files don’t is BytesIO.getbuffer()
: it returns a memoryview
of the underlying data.
Unlike bytes
objects, a memoryview
is a view into existing memory, so using it doesn’t allocate any new memory:
with report_allocated("Creating BytesIO"):
# ... same as above ...
with report_allocated("BytesIO.getbuffer()"):
data : memoryview = f.getbuffer()
When we run this, we get:
Creating BytesIO:
Allocated additional 57 MiB
BytesIO.getbuffer():
Allocated additional 0 MiB
Problem solved!
In some cases, anyway; you can write a memoryview
to a regular file, for example.
But sometimes a memoryview
isn’t what you want.
Some limitations of memoryview
One problem with memoryview
is that it lacks many of the methods that bytes
has.
For example, we can’t do memoryview.find()
:
>>> data = b"abcd"
>>> data.find(b"c")
2
>>> data_view = memoryview(data)
>>> data_view.find(b"c")
Traceback (most recent call last):
File "<python-input-3>", line 1, in <module>
data_view.find(b"c")
^^^^^^^^^^^^^^
AttributeError: 'memoryview' object has no attribute 'find'
>>>
A more obscure but still real problem is accessing memoryview
objects from compiled extensions.
Access has to happen using the Python buffer protocol.
These are part of the stable C ABI only starting in CPython version 3.11.
At the time of writing, most open source projects are also supporting Python 3.10 and 3.9.
If you’re working on such a project, and it’s compiling a single extension using the 3.9 or 3.10 versions of the ABI (i.e. abi3
wheels), you can’t use the buffer protocol yet.
Which means you can’t access memoryview
objects.
This problem will go away in October 2027 when 3.10 is end-of-life and open source projects drop support for anything before 3.11.
From memoryview
to bytes
You can deal with both these limitations by creating a new bytes
object out of a memoryview
, e.g.:
>>> new_data = bytes(data_view)
But that copies the data, allocating memory and undoing all the memory-saving benefits of using BytesIO.getbuffer()
.
In other words, bytes(my_bytesio.getbuffer())
uses the same amount of memory as my_bytesio.read()
.
So that’s not helpful.
BytesIO.getvalue()
: surprisingly efficient
Another options is BytesIO.getvalue()
, which returns the contents of the BytesIO
as a bytes
object.
My assumption has always been that this creates a copy of the underlying data.
I was wrong! The CPython developers are actually much smarter than that.
with report_allocated("Creating BytesIO"):
# ... same as above ...
with report_allocated("BytesIO.getvalue()"):
data : bytes = f.getvalue()
When run, we get:
Creating BytesIO:
Allocated additional 57 MiB
BytesIO.getvalue():
Allocated additional 0 MiB
This is magic.
We’re getting a new bytes
object, without allocating any memory.
How does this work?
BytesIO
is using copy-on-write.
Internally, it keeps a reference to the new bytes
object returned from getvalue()
.
So long as you don’t write to the BytesIO
, any reads can happen off the same memory.
But when you write to the BytesIO
, it knows it can’t modify the current memory anymore (since bytes
are supposed to be read-only) and only at this point will it allocate new memory.
with report_allocated("Creating BytesIO"):
# ... same as above ...
with report_allocated("BytesIO.getvalue()"):
data : bytes = f.getvalue()
with report_allocated("write to BytesIO"):
f.seek(0, 2) # go to the end of the file
f.write(b"hello")
When we run this we get:
Creating BytesIO:
Allocated additional 57 MiB
BytesIO.getvalue():
Allocated additional 0 MiB
write to BytesIO:
Allocated additional 50 MiB
This allows BytesIO.getvalue()
to allocate no memory in the common case where you only read from the BytesIO
after you’re done writing.
To see some real-world impacts of switching from
read()
togetvalue()
, see this pull request I opened against Polars; the work was sponsored by G-Research’s open source program office.Does your company need help maintaining or contributing to open source projects? Send me an email to see if I have availability.
getvalue()
or getbuffer()
?
To summarize, if you want to minimize memory usage when extracting data from BytesIO
:
- Avoid
BytesIO.read()
. - If you need the contents as
bytes
, useBytesIO.getvalue()
. - If you can use
memoryview
, useBytesIO.getbuffer()
.