Pandas vectorization: faster code, slower code, bloated memory
When you’re processing data with Pandas, so-called “vectorized” operations can significantly speed up your code. Or at least, that’s the theory.
In practice, in some situations Pandas vectorized operations can actually make your code slower, or at least no faster. And they can also significantly increase memory usage.
Let’s dig in and see what vectorization means in Pandas, when and why it helps, and when it’s harmful.
Vectorization: what it means, and how it speeds up your code
Vectorization can mean different things, as discussed in a more in-depth article on what vectorization means in Python. For our purposes there are two relevant meanings:
- Batch API: An API that can process multiple items of data at once.
- A native-code loop: In addition to exposing a batch API, the implementation runs quickly by not calling back into Python.
Importantly, in the context of Pandas’ documentation, vectorization only guarantees the first definition: the ability to run an operation across a whole
Index, or even
DataFrame at once.
In some cases, though, the APIs also implement the second defintion as well.
Consider the following semantically equivalent calculations:
# ... Vectorized operation: df["ratio"] = 100 * (df["x"] / df["y"]) # ... Non-vectorized operation: def calc_ratio(row): return 100 * (row["x"] / row["y"]) df["ratio2"] = df.apply(calc_ratio, axis=1)
If we measure how long each takes to run, the result is:
Vectorized: 0.0043 secs Non-vectorized: 5.6435 secs
100 * (df["x"] / df["y"]) is much faster because it avoids using Python code in the inner loop.
Series are often stored as NumPy arrays, in this case arrays of floats.
Pandas is smart enough to pass the multiplication and division on to the underlying arrays, which then do a loop in machine code to do the multiplication.
No slow Python code is involved in doing the arithmetic.
In contrast, the non-vectorized method calls a Python function for every row, and that Python function does additional operations. Eventually this devolves into low-level multiplication and division, but there is slow and expensive Python code being called repeatedly for every single row.
Our initial attempt at using vectorization appears to be a complete success: the code runs vastly faster.
Vectorization in strings
Let’s see how vectorization does on strings: Pandas provides a
.str object on
Series that lets you run various vectorized operations on strings.
As an example, we’re going to calculate how many words there are in each sentence in a
# ... Vectorized operation: df["sentence_length"] = df["sentences"].str.split().apply( len ) # ... Non-vectorized operation: def sentence_length(s): return len(s.split()) df["sentence_length2"] = df["sentences"].apply( sentence_length )
The resulting run times:
Vectorized: 1.492 secs Non-vectorized: 0.280 secs
The vectorized code is much slower! What’s going on?
Let’s measure the code with the Sciagraph profiler and see what’s going on. Here’s a timeline of execution:
It looks like in the vectorized path the code eventually devolves to a Python function that is applied to every row—just like our code:
def _str_split( self, pat: str | re.Pattern | None = None, n=-1, expand=False, regex: bool | None = None, ): # ... f = lambda x: x.split(pat, n) # ... return self._str_map(f, dtype=object)
So in this case, unlike the numeric calculations, the underlying implementation is in the end the same Python code, with a lot of overhead added on.
This is why it’s so important to understand that vectorization has multiple meanings.
Vectorization in Pandas doesn’t necessarily mean the code will run faster, it just means the API lets you operate in batches, in this case on a whole
Sometimes that means it’ll also be faster, by running the whole loop with fast native code—the second meaning of vectorization—as was the case with the numeric calculations above.
For strings, the code can be slower.
The memory-use implications of vectorization
So far we’ve been focusing on performance, but memory usage can also be a bottleneck. Use too much memory, and your program might swap and start running slowly, or even crash.
How does vectorization impact memory use? Using the Fil memory profiler, I measured memory usage of each of the two methods we used to count the number of words in sentences. Here’s the vectorized code:
And here’s the non-vectorized code:
The vectorized code uses vastly more memory.
Inspecting the profiling result above, and reading the original code, we can see what’s going on: a temporary
Series is being created, containing lists of words.
The code we wrote is equivalent to:
temporary_series = df["sentences"].str.split() df["sentence_length"] = temporary_series.apply(len)
Series is massively increasing memory usage, by storing heavyweight Python objects (lists and strings), and we don’t even need it.
In the non-vectorized code we split a sentence string into a list, run
len() on that, and then throw away the list.
So we only use temporary memory for the words of a single sentence at time, rather than all sentences at once:
O(1) instead of
This results in far less memory usage.
- Be aware of the multiple meanings of vectorization. In Pandas, it just means a batch API.
- Numeric code in Pandas often benefits from the second meaning of vectorization, a vastly faster native code loop.
- Vectorization in strings in Pandas can often be slower, since it doesn’t use native code loops.
- Vectorization can result in temporary
Series, with a corresponding increase in memory usage proportional to the
More broadly: if you care about performance and memory usage, you need to measure it! Some results above I did not expect; some of the solutions I tried didn’t work and were therefore omitted.
Without measurement, you have no insight in to where the bottlenecks are. For offline memory profiling, you can use the open source Fil profiler and for offline performance profiling you can use py-spy or Austin. For production performance and memory profiling of Python batch jobs, try out Sciagraph.
Data processing too slowly? Cloud compute bill too high?
You can get faster results from your data science pipeline—and get some money back too—if you can just figure out why your code is running slowly.
Identify performance bottlenecks and memory hogs in your production data science Python jobs with Sciagraph, the always-on profiler for production batch jobs.
Learn practical Python software engineering skills you can use at your job
Too much to learn? Don't know where to start?
Sign up for my newsletter, and join over 6100 Python developers and data scientists learning practical tools and techniques, from Docker packaging to testing to Python best practices, with a free new article in your inbox every week.