Speed up your Python data processing workflows with the Sciagraph profiler

Whether it’s detecting disease, modeling the electric grid, or whatever data processing or analysis you do with Python, inefficient code is a cost you can’t afford to pay:

  • If it takes 30 minutes to run your code, debugging minor changes can waste your whole afternoon.
  • Run out of memory, and your program is dead; you’re not getting any results until you fix that.
  • Once you’re running in production at scale, inefficient software means throwing money at your cloud provider. You probably need that money more than they do.

On the other hand, the faster your software, the easier it will be for you to iterate and improve. And the faster your software, the happier your users (and accountant) will be.

Profilers can you help you find speed and memory bottlenecks in your code; instead of guessing, you can quickly fix the problem. Unfortunately, profilers that work well for web applications don’t necessary work as well when it comes to data processing. You need a profiler designed for your kind of software.

Sciagraph is a Linux and macOS profiler that gives you deep visibility into your Python code’s speed and memory usage—with a focus on data science, scientific computing, and data analysis software. It’s designed specifically for the needs of people like you, from measurements to visualizations to integrations (Jupyter, MLFlow, and Celery, with more coming soon).

What users are saying:

“Sciagraph both gave us excellent CPU flamegraphs that we have come to love, but also a memory flamegraph! Within about 15 minutes of installing Sciagraph, we had a memory profile that made it blatantly obvious where our suboptimal memory usage was. Within about an hour, we had things fixed and deployed.

Perhaps the best bang for buck I’ve spent on a dev tool license in a long time.

—Luke Hsiao, Numbers Station

What’s new:
  • June 21, 2023: Fixed bug where non-Python subprocesses would fail with an undefined symbol error.
  • June 21, 2023: Visualize differences between two profiling reports with the new sciagraph-diff tool.

See the full release notes for details.

Ready to speed up your code? Get started with Sciagraph’s free plan!

Identify performance bottlenecks in calculations, data loading, and more

Sciagraph gives you a timeline showing where your threads spent their time: both CPU and waiting for locks, network communication, filesystem reads and writes, and so on.

Note: You’ll have an easier time viewing this on a computer, with the window maximized; the output is not designed for phones!

Above you can see the profiling report for a program that reads in some text, splits it into words, filters out certain words, and writes the result to JSON. This is a typical structure for data processing workflows: load the inputs → process the data → write out the output.

Wider and redder frames means more of the time was spent in the part of the program. Hover your mouse over a frame to see the text; real reports also include zoom functionality.

In this example, you can see that:

  1. Reading the data was fast enough not to show up in the profiling.
  2. Processing the data, in this case filtering the words, was pretty CPU-intensive.
  3. Writing the data to disk was slow, but not because of CPU: it involved a lot of waiting. In this case, it’s because the program was writing to a remote filesystem.

Ready to speed up your code? Get started with Sciagraph’s free plan!

Discover where and why you’re using too much memory

Sciagraph also reports peak memory usage, the high water mark:

You can click on a frame to zoom in on a stack trace; wider and redder frames means more memory usage.

This example shows the memory usage report for the same program. There are three main sources of memory usage, the most significant being parsing the input file into words.

Ready to speed up your code? Get started with Sciagraph’s free plan!

Profile on your laptop—and in production!

Reproducing production performance problems on your laptop is often difficult, and sometimes impossible. And even measuring in production, reproducing problems after the fact is not easy.

Unfortunately, many profilers are simply not designed to run in production. And those that are designed to run in production focus on—you guessed it!—web applications.

That’s why you can use Sciagraph both during development, and optionally in production, with always-on continuous profiling:

  • Runs with low overhead, so you can leave it always on by default. No more after-the-fact scrambling to reproduce the problem!
  • Designed to be easy to set up and highly reliable with real workloads.
  • Securely store profiling reports in the cloud so you don’t need to configure storage yourself.

Ready to speed up your code? Get started with Sciagraph’s free plan!

More features

  • Multiprocessing support: Profile single-threaded, multi-threaded, and multiprocessing workflows.
  • Fast setup: For simple Python processes, using Sciagraph may be as easy as setting an environment variable. And with Jupyter, MLFlow, and Celery integration built-in, and other framework support planned, Sciagraph is designed to work out of the box with the frameworks you use. Need support for another framework? Send me an email to get it prioritized.

Optional support for continuous, always-on profiling in production:

  • Fast and robust enough to run in production: Sciagraph is designed to minimize impact on your job’s performance, so you won’t even notice it’s there until you need it.
  • Cloud storage for reports: No need to spend time figuring out how to store profiling reports: they can be automatically and securely stored in the cloud. That means you can easily access performance reports even if your runtime environment is ephemeral, like a container.
  • Data privacy: Profiling reports never include user data, and when they’re uploaded to Sciagraph’s cloud storage they are encrypted end-to-end, so no one but you can access them.

Ready to speed up your code? Get started with Sciagraph’s free plan!