Saving memory with Pandas 1.3’s new string dtype
When you’re loading many strings into Pandas, you’re going to use a lot of memory. If you have only a limited number of strings, you can save memory with categoricals, but that’s only helpful in a limited number of situations.
With Pandas 1.3, there’s a new option that can save memory on large number of strings as well, simply by changing to a new column type. Let’s see how.
Pandas’ different string dtypes
Every pandas.Series
, and every column in a pandas.DataFrame
, have a dtype: the type of object stored inside it.
By default, Pandas will store strings using the object dtype, meaning it store strings as NumPy array of pointers to normal Python object.
In Pandas 1.0, a new "string"
dtype was added, but as we’ll see it didn’t have any impact on memory usage.
And in Pandas 1.3, a new Arrow-based dtype was added, "string[pyarrow]"
(see the Pandas release notes for complete details).
Arrow is a data format for storing columnar data, the exact kind of data Pandas represents. Among other column types, Arrow supports storing a column of strings, and it does so in a more efficient way than Python does.
Using the new Arrow string dtype
Let’s compare the memory usage of all three dtypes, starting by storing a series of random strings that look like 0.4703931350378988
or 0.8031466080667982
.
Here’s our test script; we’re using the memory_usage(deep=True)
API, which is explained in a separate article on measuring Pandas memory usage.
from random import random
import sys
import pandas as pd
prefix = sys.argv[1]
# A Python list of strings generated from random numbers:
random_strings = [
prefix + str(random()) for i in range(1_000_000)
]
# The default dtype, object:
object_dtype = pd.Series(random_strings)
print("object", object_dtype.memory_usage(deep=True))
# A normal Pandas string dtype:
standard_dtype = pd.Series(random_strings, dtype="string")
print("string", standard_dtype.memory_usage(deep=True))
# The new Arrow string dtype from Pandas 1.3:
arrow_dtype = pd.Series(
random_strings, dtype="string[pyarrow]"
)
print("arrow ", arrow_dtype.memory_usage(deep=True))
We can pass in a prefix that gets added to all strings. For now, we’ll leave it empty, and just store the random strings; they will typically be 18 characters long:
$ python string_dtype.py ""
object 75270787
string 75270787
arrow 22270791
As you can see, the normal object dtype and the string dtype use the same amount of memory, but the Arrow dtype uses far less.
Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.
Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.
Why the difference?
When Python stores a string, it stores quite a bit of metadata: the overhead of the Python object itself, to begin with, and then a bunch of metadata about the string, and finally the string itself. We can measure object size like so:
>>> import sys
>>> sys.getsizeof("")
49
>>> sys.getsizeof("0.8031466080667982")
67
For a million strings, we expect memory usage to be the size of the strings themselves (67 × 1,000,000) plus the NumPy array with pointers (8 × 1,000,000):
>>> (67 + 8) * 1_000_000
75000000
That’s pretty close to the memory we measured above, 75270787.
Notice that for each 18-character string, which can be represented in 18 bytes in ASCII or UTF-8 encodings, there’s 49 bytes of overhead. In contrast, the Arrow representation stores strings with far less overhead:
>>> 22270791 / 1_000_000
22.270791
That’s just 4 bytes of overhead per string when using Arrow, compare to 49 for normal string columns. For small strings, that’s a huge difference. As the strings get larger, the overhead from Python’s representation matters less. We can use a very large prefix to store much larger strings:
$ python string_dtype.py "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx"
object 442269865
string 442269865
arrow 389269869
Arrow is more efficient, but it makes a lot less of difference.
Giving it a try
At the time of writing, this new column dtype is just a month old (it was released in 1.3.0, in early July 2021), and it’s marked as “experimental”. But the massive reduction in memory overhead is very useful, in particular when you have predominantly small strings.
If strings are memory bottleneck in your program, do give it a try—and if you find problems, file a bug with the Pandas maintainers, so that this can move from experimental to stable status.