Saving memory with Pandas 1.3’s new string dtype
When you’re loading many strings into Pandas, you’re going to use a lot of memory. If you have only a limited number of strings, you can save memory with categoricals, but that’s only helpful in a limited number of situations.
With Pandas 1.3, there’s a new option that can save memory on large number of strings as well, simply by changing to a new column type. Let’s see how.
Pandas’ different string dtypes
pandas.Series, and every column in a
pandas.DataFrame, have a dtype: the type of object stored inside it.
By default, Pandas will store strings using the object dtype, meaning it store strings as NumPy array of pointers to normal Python object.
In Pandas 1.0, a new
"string" dtype was added, but as we’ll see it didn’t have any impact on memory usage.
And in Pandas 1.3, a new Arrow-based dtype was added,
"string[pyarrow]" (see the Pandas release notes for complete details).
Arrow is a data format for storing columnar data, the exact kind of data Pandas represents. Among other column types, Arrow supports storing a column of strings, and it does so in a more efficient way than Python does.
Using the new Arrow string dtype
Let’s compare the memory usage of all three dtypes, starting by storing a series of random strings that look like
Here’s our test script; we’re using the
memory_usage(deep=True) API, which is explained in a separate article on measuring Pandas memory usage.
from random import random import sys import pandas as pd prefix = sys.argv # A Python list of strings generated from random numbers: random_strings = [ prefix + str(random()) for i in range(1_000_000) ] # The default dtype, object: object_dtype = pd.Series(random_strings) print("object", object_dtype.memory_usage(deep=True)) # A normal Pandas string dtype: standard_dtype = pd.Series(random_strings, dtype="string") print("string", standard_dtype.memory_usage(deep=True)) # The new Arrow string dtype from Pandas 1.3: arrow_dtype = pd.Series( random_strings, dtype="string[pyarrow]" ) print("arrow ", arrow_dtype.memory_usage(deep=True))
We can pass in a prefix that gets added to all strings. For now, we’ll leave it empty, and just store the random strings; they will typically be 18 characters long:
$ python string_dtype.py "" object 75270787 string 75270787 arrow 22270791
As you can see, the normal object dtype and the string dtype use the same amount of memory, but the Arrow dtype uses far less.
Why the difference?
When Python stores a string, it stores quite a bit of metadata: the overhead of the Python object itself, to begin with, and then a bunch of metadata about the string, and finally the string itself. We can measure object size like so:
>>> import sys >>> sys.getsizeof("") 49 >>> sys.getsizeof("0.8031466080667982") 67
For a million strings, we expect memory usage to be the size of the strings themselves (67 × 1,000,000) plus the NumPy array with pointers (8 × 1,000,000):
>>> (67 + 8) * 1_000_000 75000000
That’s pretty close to the memory we measured above, 75270787.
Notice that for each 18-character string, which can be represented in 18 bytes in ASCII or UTF-8 encodings, there’s 49 bytes of overhead. In contrast, the Arrow representation stores strings with far less overhead:
>>> 22270791 / 1_000_000 22.270791
That’s just 4 bytes of overhead per string when using Arrow, compare to 49 for normal string columns. For small strings, that’s a huge difference. As the strings get larger, the overhead from Python’s representation matters less. We can use a very large prefix to store much larger strings:
$ python string_dtype.py "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx" object 442269865 string 442269865 arrow 389269869
Arrow is more efficient, but it makes a lot less of difference.
Giving it a try
At the time of writing, this new column dtype is just a month old (it was released in 1.3.0, in early July 2021), and it’s marked as “experimental”. But the massive reduction in memory overhead is very useful, in particular when you have predominantly small strings.
If strings are memory bottleneck in your program, do give it a try—and if you find problems, file a bug with the Pandas maintainers, so that this can move from experimental to stable status.
Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.
Wasting compute money on processes that use too much memory?
Your Python batch process is using too much memory, and you have no idea which part of your code is responsible.
You need a tool that will tell you exactly where to focus your optimization efforts, a tool designed for data scientists and scientists. Learn how the Fil memory profiler can help you.
How do you process large datasets with limited memory?
Get a free cheatsheet summarizing how to process large amounts of data with limited memory using Python, NumPy, and Pandas.
Plus, every week or so you’ll get new articles showing you how to process large data, and more generally improve you software engineering skills, from testing to packaging to performance: