Speed up production Python data-processing jobs with always-on profiling

You’re running a data processing batch job written in Python, and it’s far too slow, not to mention that your compute costs are way too high. Now you have questions:

  • Why is the code slow?
  • Why is it using so much memory?

Here’s a personal example: I once worked on a scientific image-processing pipeline that would take 8-12 hours to finish. This was too slow—and much too expensive. A quick back-of-the-envelope calculation showed we were going to spend 70% of the company’s projected revenue just on cloud computing.

Needless to say, I spent the next month optimizing the code!

And it turns out that identifying the underlying performance and memory problems in long-running batch processes can be a slow, difficult process:

  • Insufficient tooling: You may be missing the necessary tooling to understand what’s going on inside the process.
  • Production is different than dev: Running in production is never quite the same as running on a developer’s laptop; the compute resources are different, network bottlenecks are different, and so on.
  • Data-specific problems: Often performance or memory problems are tied to specific datasets. So you need to identify the problem dataset, and then expose it to your development environment with the same access and performance characteristics as production.
  • Slow feedback loop: A slow-running batch processes means your feedback loop is also slow. You deploy a potential fix, wait an hour (or a day!) for results from production, if that didn’t work try to reproduce again and figure out what went wrong.

Profiling production jobs—in production

Is there an easier way to profile your production jobs?

Here’s one idea: let’s pretend for a moment that you have access to a time machine. Whenever you realized a batch job was slow, you could go back in time, enable some profilers, and then travel back to the present. Then, when the batch job finished, you would have an exact record of performance and memory usage, exactly as it ran in production.

Only problem is, time machines don’t exist.

But there’s an alternative that can work: enabling a profiler on all your production jobs from the start, by default. Then you would have full performance and memory profiling on every batch job you ran. When a job turned out to be too slow, use too much memory, or your compute budget was getting a little too high, you could go back to those reports and immediately identify and then fix the problem.

But here we encounter another obstacle: most profilers have high performance overhead, and are not designed to be run in production. So enabling them from the start can slow down your code, and make it more likely to crash or fail.

Fil4prod: an always-on, continuous profiler for data science and scientific computing

This is where the Fil4prod profiler can help.

  • For Python batch jobs: Specifically designed for long-running data processing Python batch jobs, like data pipelines or scientific computing.
  • Performance profiling: identifies where time is being spent, and whether time is spent running or waiting.
  • Memory profiling: reports source of allocations for peak memory usage, so you can optimize the relevant bottlenecks.
  • Low overhead: for many data processing jobs the overhead should be negligible.
  • Designed for production: Designed from the very start to be safe and reliable when running in the real world.

Gain performance insight into your code, as it runs production

Let’s take a look at some examples of what Fil4prod can tell you.

Here we can see two Python threads fighting over CPython’s Global Interpreter Lock, which prevents than one Python thread from running at a time; wider and redder frames mean more time taken. You can click on a frame to get a traceback.

And here we can see where peak memory usage is coming from in a different program; again, wider and redder means more memory usage. You can click on a frame to get a traceback. Given this is a more complex flamegraph, I recommend looking at it with your full screen by using the button below.

Get access to Fil4prod

Fil4prod is not currently generally available, but you can sign up below to get put on the early access list.

Get early access to Fil4prod

Sign up to get notified when you can try Fil4prod out.

In the interim, you'll join my newsletter and get weekly articles covering Python performance, memory optimization, Docker packaging, and other practical Python skills.