Saving memory with Pandas 1.3’s new string dtype
When you’re loading many strings into Pandas, you’re going to use a lot of memory. If you have only a limited number of strings, you can save memory with categoricals, but that’s only helpful in a limited number of situations.
With Pandas 1.3, there’s a new option that can save memory on large number of strings as well, simply by changing to a new column type. Let’s see how.
Pandas’ different string dtypes
pandas.Series, and every column in a
pandas.DataFrame, have a dtype: the type of object stored inside it.
By default, Pandas will store strings using the object dtype, meaning it store strings as NumPy array of pointers to normal Python object.
In Pandas 1.0, a new
"string" dtype was added, but as we’ll see it didn’t have any impact on memory usage.
And in Pandas 1.3, a new Arrow-based dtype was added,
"string[pyarrow]" (see the Pandas release notes for complete details).
Arrow is a data format for storing columnar data, the exact kind of data Pandas represents. Among other column types, Arrow supports storing a column of strings, and it does so in a more efficient way than Python does.
Using the new Arrow string dtype
Let’s compare the memory usage of all three dtypes, starting by storing a series of random strings that look like
Here’s our test script; we’re using the
memory_usage(deep=True) API, which is explained in a separate article on measuring Pandas memory usage.
from random import random import sys import pandas as pd prefix = sys.argv # A Python list of strings generated from random numbers: random_strings = [ prefix + str(random()) for i in range(1_000_000) ] # The default dtype, object: object_dtype = pd.Series(random_strings) print("object", object_dtype.memory_usage(deep=True)) # A normal Pandas string dtype: standard_dtype = pd.Series(random_strings, dtype="string") print("string", standard_dtype.memory_usage(deep=True)) # The new Arrow string dtype from Pandas 1.3: arrow_dtype = pd.Series( random_strings, dtype="string[pyarrow]" ) print("arrow ", arrow_dtype.memory_usage(deep=True))
We can pass in a prefix that gets added to all strings. For now, we’ll leave it empty, and just store the random strings; they will typically be 18 characters long:
$ python string_dtype.py "" object 75270787 string 75270787 arrow 22270791
As you can see, the normal object dtype and the string dtype use the same amount of memory, but the Arrow dtype uses far less.
Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.
Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.
Why the difference?
When Python stores a string, it stores quite a bit of metadata: the overhead of the Python object itself, to begin with, and then a bunch of metadata about the string, and finally the string itself. We can measure object size like so:
>>> import sys >>> sys.getsizeof("") 49 >>> sys.getsizeof("0.8031466080667982") 67
For a million strings, we expect memory usage to be the size of the strings themselves (67 × 1,000,000) plus the NumPy array with pointers (8 × 1,000,000):
>>> (67 + 8) * 1_000_000 75000000
That’s pretty close to the memory we measured above, 75270787.
Notice that for each 18-character string, which can be represented in 18 bytes in ASCII or UTF-8 encodings, there’s 49 bytes of overhead. In contrast, the Arrow representation stores strings with far less overhead:
>>> 22270791 / 1_000_000 22.270791
That’s just 4 bytes of overhead per string when using Arrow, compare to 49 for normal string columns. For small strings, that’s a huge difference. As the strings get larger, the overhead from Python’s representation matters less. We can use a very large prefix to store much larger strings:
$ python string_dtype.py "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx" object 442269865 string 442269865 arrow 389269869
Arrow is more efficient, but it makes a lot less of difference.
Giving it a try
At the time of writing, this new column dtype is just a month old (it was released in 1.3.0, in early July 2021), and it’s marked as “experimental”. But the massive reduction in memory overhead is very useful, in particular when you have predominantly small strings.
If strings are memory bottleneck in your program, do give it a try—and if you find problems, file a bug with the Pandas maintainers, so that this can move from experimental to stable status.
Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.
Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler
Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.
Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support, on macOS and Linux, with built-in support for Jupyter notebooks.
Learn practical Python software engineering skills you can use at your job
Sign up for my newsletter, and join over 6900 Python developers and data scientists learning practical tools and techniques, from Python performance to Docker packaging, with a free new article in your inbox every week.