Saving memory with Pandas 1.3’s new string dtype

When you’re loading many strings into Pandas, you’re going to use a lot of memory. If you have only a limited number of strings, you can save memory with categoricals, but that’s only helpful in a limited number of situations.

With Pandas 1.3, there’s a new option that can save memory on large number of strings as well, simply by changing to a new column type. Let’s see how.

Pandas’ different string dtypes

Every pandas.Series, and every column in a pandas.DataFrame, have a dtype: the type of object stored inside it. By default, Pandas will store strings using the object dtype, meaning it store strings as NumPy array of pointers to normal Python object.

In Pandas 1.0, a new "string" dtype was added, but as we’ll see it didn’t have any impact on memory usage. And in Pandas 1.3, a new Arrow-based dtype was added, "string[pyarrow]" (see the Pandas release notes for complete details).

Arrow is a data format for storing columnar data, the exact kind of data Pandas represents. Among other column types, Arrow supports storing a column of strings, and it does so in a more efficient way than Python does.

Using the new Arrow string dtype

Let’s compare the memory usage of all three dtypes, starting by storing a series of random strings that look like 0.4703931350378988 or 0.8031466080667982. Here’s our test script; we’re using the memory_usage(deep=True) API, which is explained in a separate article on measuring Pandas memory usage.

from random import random
import sys
import pandas as pd

prefix = sys.argv[1]

# A Python list of strings generated from random numbers:
random_strings = [
    prefix + str(random()) for i in range(1_000_000)
]

# The default dtype, object:
object_dtype = pd.Series(random_strings)
print("object", object_dtype.memory_usage(deep=True))

# A normal Pandas string dtype:
standard_dtype = pd.Series(random_strings, dtype="string")
print("string", standard_dtype.memory_usage(deep=True))

# The new Arrow string dtype from Pandas 1.3:
arrow_dtype = pd.Series(
    random_strings, dtype="string[pyarrow]"
)
print("arrow ", arrow_dtype.memory_usage(deep=True))

We can pass in a prefix that gets added to all strings. For now, we’ll leave it empty, and just store the random strings; they will typically be 18 characters long:

$ python string_dtype.py ""
object 75270787
string 75270787
arrow  22270791

As you can see, the normal object dtype and the string dtype use the same amount of memory, but the Arrow dtype uses far less.

Why the difference?

When Python stores a string, it stores quite a bit of metadata: the overhead of the Python object itself, to begin with, and then a bunch of metadata about the string, and finally the string itself. We can measure object size like so:

>>> import sys
>>> sys.getsizeof("")
49
>>> sys.getsizeof("0.8031466080667982")
67

For a million strings, we expect memory usage to be the size of the strings themselves (67 × 1,000,000) plus the NumPy array with pointers (8 × 1,000,000):

>>> (67 + 8) * 1_000_000
75000000

That’s pretty close to the memory we measured above, 75270787.

Notice that for each 18-character string, which can be represented in 18 bytes in ASCII or UTF-8 encodings, there’s 49 bytes of overhead. In contrast, the Arrow representation stores strings with far less overhead:

>>> 22270791 / 1_000_000
22.270791

That’s just 4 bytes of overhead per string when using Arrow, compare to 49 for normal string columns. For small strings, that’s a huge difference. As the strings get larger, the overhead from Python’s representation matters less. We can use a very large prefix to store much larger strings:

$ python string_dtype.py "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx"
object 442269865
string 442269865
arrow  389269869

Arrow is more efficient, but it makes a lot less of difference.

Giving it a try

At the time of writing, this new column dtype is just a month old (it was released in 1.3.0, in early July 2021), and it’s marked as “experimental”. But the massive reduction in memory overhead is very useful, in particular when you have predominantly small strings.

If strings are memory bottleneck in your program, do give it a try—and if you find problems, file a bug with the Pandas maintainers, so that this can move from experimental to stable status.


Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.