Pandas vectorization: faster code, slower code, bloated memory
When you’re processing data with Pandas, so-called “vectorized” operations can significantly speed up your code. Or at least, that’s the theory.
In practice, in some situations Pandas vectorized operations can actually make your code slower, or at least no faster. And they can also significantly increase memory usage.
Let’s dig in and see what vectorization means in Pandas, when and why it helps, and when it’s harmful.
Vectorization: what it means, and how it speeds up your code
Vectorization can mean different things, as discussed in a more in-depth article on what vectorization means in Python. For our purposes there are two relevant meanings:
- Batch API: An API that can process multiple items of data at once.
- A native-code loop: In addition to exposing a batch API, the implementation runs quickly by not calling back into Python.
Importantly, in the context of Pandas’ documentation, vectorization only guarantees the first definition: the ability to run an operation across a whole Series
, Index
, or even DataFrame
at once.
In some cases, though, the APIs also implement the second defintion as well.
Consider the following semantically equivalent calculations:
# ... Vectorized operation:
df["ratio"] = 100 * (df["x"] / df["y"])
# ... Non-vectorized operation:
def calc_ratio(row):
return 100 * (row["x"] / row["y"])
df["ratio2"] = df.apply(calc_ratio, axis=1)
If we measure how long each takes to run, the result is:
Vectorized: 0.0043 secs
Non-vectorized: 5.6435 secs
The vectorized 100 * (df["x"] / df["y"])
is much faster because it avoids using Python code in the inner loop.
Internally, Pandas Series
are often stored as NumPy arrays, in this case arrays of floats.
Pandas is smart enough to pass the multiplication and division on to the underlying arrays, which then do a loop in machine code to do the multiplication.
No slow Python code is involved in doing the arithmetic.
In contrast, the non-vectorized method calls a Python function for every row, and that Python function does additional operations. Eventually this devolves into low-level multiplication and division, but there is slow and expensive Python code being called repeatedly for every single row.
Our initial attempt at using vectorization appears to be a complete success: the code runs vastly faster.
Vectorization in strings
Let’s see how vectorization does on strings: Pandas provides a .str
object on Series
that lets you run various vectorized operations on strings.
As an example, we’re going to calculate how many words there are in each sentence in a Series
:
# ... Vectorized operation:
df["sentence_length"] = df["sentences"].str.split().apply(
len
)
# ... Non-vectorized operation:
def sentence_length(s):
return len(s.split())
df["sentence_length2"] = df["sentences"].apply(
sentence_length
)
The resulting run times:
Vectorized: 1.492 secs
Non-vectorized: 0.280 secs
The vectorized code is much slower! What’s going on?
Let’s measure the code with the Sciagraph profiler and see what’s going on. Here’s a timeline of execution:
It looks like in the vectorized path the code eventually devolves to a Python function that is applied to every row—just like our code:
def _str_split(
self,
pat: str | re.Pattern | None = None,
n=-1,
expand=False,
regex: bool | None = None,
):
# ...
f = lambda x: x.split(pat, n)
# ...
return self._str_map(f, dtype=object)
So in this case, unlike the numeric calculations, the underlying implementation is in the end the same Python code, with a lot of overhead added on.
This is why it’s so important to understand that vectorization has multiple meanings.
Vectorization in Pandas doesn’t necessarily mean the code will run faster, it just means the API lets you operate in batches, in this case on a whole Series
.
Sometimes that means it’ll also be faster, by running the whole loop with fast native code—the second meaning of vectorization—as was the case with the numeric calculations above.
For strings, the code can be slower.
Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.
Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.
The memory-use implications of vectorization
So far we’ve been focusing on performance, but memory usage can also be a bottleneck. Use too much memory, and your program might swap and start running slowly, or even crash.
How does vectorization impact memory use? Using the Fil memory profiler, I measured memory usage of each of the two methods we used to count the number of words in sentences. Here’s the vectorized code:
And here’s the non-vectorized code:
The vectorized code uses vastly more memory.
Inspecting the profiling result above, and reading the original code, we can see what’s going on: a temporary Series
is being created, containing lists of words.
The code we wrote is equivalent to:
temporary_series = df["sentences"].str.split()
df["sentence_length"] = temporary_series.apply(len)
This temporary Series
is massively increasing memory usage, by storing heavyweight Python objects (lists and strings), and we don’t even need it.
In the non-vectorized code we split a sentence string into a list, run len()
on that, and then throw away the list.
So we only use temporary memory for the words of a single sentence at time, rather than all sentences at once: O(1)
instead of O(N)
.
This results in far less memory usage.
Takeaways
- Be aware of the multiple meanings of vectorization. In Pandas, it just means a batch API.
- Numeric code in Pandas often benefits from the second meaning of vectorization, a vastly faster native code loop.
- Vectorization in strings in Pandas can often be slower, since it doesn’t use native code loops.
- Vectorization can result in temporary
Series
, with a corresponding increase in memory usage proportional to theSeries
size.
More broadly: if you care about performance and memory usage, you need to measure it! Some results above I did not expect; some of the solutions I tried didn’t work and were therefore omitted.
Without measurement, you have no insight in to where the bottlenecks are. For offline memory profiling, you can use the open source Fil profiler and for offline performance profiling of generic Python programs you can use py-spy or Austin. For performance and memory profiling of Python data processing jobs, whether in development or production, try out the Sciagraph profiler.