Finding performance bottlenecks in Celery tasks

When your Celery tasks are too slow, and you want them to run faster, you usually want to find and then fix the performance bottleneck. It’s true, you can architect a solution where slow tasks don’t impact faster ones, and you may sometimes need to. But if you can manage to make all your tasks fast, that is ideal.

To speed up your code, you need to identify the bottlenecks: which task is slow, and why it’s slow. In this article we’ll go through the process for finding the performance bottleneck in your Celery tasks, explaining some of the tradeoffs and options along the way:

  1. Identifying slow tasks.
  2. Finding predictable-in-advance performance bottlenecks with code instrumentation.
  3. Profiling to find harder-to-predict bottlenecks, either:
    • On a development machine.
    • Or, in production.

1. Identifying slow tasks

If you are running multiple different tasks, you may need to figure out which particular task is the slow one, depending on how you discovered you have a problem. If you’re running a mixture of tasks A, B, and C, and the initial symptom you spotted was high CPU on the worker machine, that doesn’t tell you which task is the problem, just that you have one or more that are slow.

To figure out which tasks are slow you can, to begin with, look at the Celery worker’s logs, specifically start and end times of tasks. Make sure you’ve started the worker with --loglevel INFO or higher, so that it records this information.

[2022-07-21 14:53:06,457: INFO/MainProcess] Task celery_tasks.mytask[d5707ad7-5640-4034-b39a-b038e0754281] received
[2022-07-21 14:53:11,100: INFO/MainProcess] Task celery_tasks.mytask[d5707ad7-5640-4034-b39a-b038e0754281] succeeded in 4.642702493991237s: None

Some sort of tracing or “application performance monitoring” (APM) service can also come in handy here, as a more sophisticated form of logging.

Other options include:

  • Using the Flower monitoring tool for Celery.
  • Looking at the upstream logs of whatever code is scheduling the tasks.

Once you’ve identified which task is slow, the next step is finding out why it’s slow.

2. Using logs and tracing to spot predictable performance bottlenecks

In some cases, logging, tracing, or APMs can help you find bottlenecks. However, the former two can only ever give you information about code that was specifically instrumented, and that is true in most cases for APMs as well.

If the performance bottleneck happens to be captured by a place where your code is logging or otherwise instrumented, you’ve found the problem! You can now switch to fixing it. For example, APMs will usually auto-instrument commonly used libraries for remote interactions, like requests for HTTP calls or sqlalchemy for SQL. So if your performance problem is a slow SQL query, that sort of instrumentation will find it pretty quickly.

However, it’s difficult—and often impractical for performance or readability reasons—to instrument every single operation in your software, let alone every line of code. This means you’ll have logging or tracing instrumentation only in places someone has decided in advance are likely to be a source problems.

Sometimes, unfortunately, your code is slow for harder to predict reasons. And that means you probably won’t have sufficient logging or tracing to identify the bottleneck. That’s where profiling becomes useful.

3. Profiling Celery tasks

Profiling involves inspecting potentially any part of the code that’s running, with no need for advanced instrumentation. Some profilers like cProfile and VizTracer will inspect every single function and line of code, but that adds significant performance overhead. So usually you’ll be using sampling profilers, which inspect running code at fixed intervals. Slow tasks tend to show up more in the resulting samples, so this works well in practice.

You can profile code on your development machine or laptop, or in production.

Profiling on a development machine

Profiling on a non-production machine has some benefits:

  • You don’t have to worry about breaking production, slowing it down, or otherwise interfering with real usage.
  • You have full control of how code runs.

But there are downsides as well:

  • Running the code may require a variety of resources that might be more difficult to set up locally.
  • Performance problems may only happen with real inputs, or real data, e.g. the production database may have different performance characteristics because it has far more data than a test setup.
  • Other environmental differences may also hide performance bottlenecks, for example network latency may be different enough to skew results. Downloading lots of records from S3 will take a different amount of time in a cloud datacenter than it will from your home or office, and both will be different than the time it takes to download from a S3-emulating service you’re running on your laptop.

Let’s see how you can profile locally.

Typically, when calling your Celery tasks you might use something like delay():

from celery_tasks import mytask

def yourcode():
    async_result = mytask.delay(10_000)
    # ...
    result = async_result.get()

This schedules the task to run in a separate process, the worker.

You can however run a task directly in the current process, which makes profiling easier. For example, we can create a file

from celery_tasks import mytask

When we run this file, mytask() will run in the same process, rather than in a separate worker. We can then easily use a profiler like py-spy to profile it:

$ pip install py-spy
$ py-spy record python 
py-spy> Sampling process 100 times a second. Press Control-C to exit. 
py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'python-2022-07-21T13:43:50-04:00.svg'. Samples: 522 Errors: 0

The result looks like this:

This is a flamegraph; left/right positions don’t matter, but width does. The wider the frame, the higher a percentage of runtime was spent there. Here we can see that JSON encoding was the bulk of the time, with some time spent in a function called deepcopy_list() (click on the right-most stack to zoom in and see the details).

Profiling in production

Profiling in production is trickier than on your development machine.

  • You don’t want to break production!
  • Slowing down production even more isn’t great either.
  • It’s harder or impossible to control when specific tasks are run.
  • A worker using the gevent or eventlet pool can be running multiple tasks interleaved in a single thread.
  • A worker using the threading pool can be running multiple tasks in the same process, so process-level profiling might mix different tasks.

There are multiple options you can take to profiling in production.

Option #1: Attach to a running process

Assuming you’re using a solo or prefork/processes pool, each worker process will only be running a single task at a time. With some preconditions, you can then get profiling info from an already running worker:

  1. You know the task you want to profile has started, or that is about to start (perhaps you can trigger it?)
  2. You know, or can check, which worker will be running it (perhaps you’ve made sure there’s enough work queued so only one worker is free?)
  3. You have terminal access to the machine or container running the worker, using SSH, docker exec, kubectl exec, or some other mechanism.
  4. You have root, or, in containerized workloads, you’ve configured the container with the CAP_SYS_PTRACE capability.

Then, you can use py-spy to attach to the relevant, already running process around the time the task starts. You let it run until the task ends, at which point you hit Ctrl-C and get a profiling report:

$ pip install py-spy
$ ps xa | grep celery
 576600 pts/7    S+     0:00 celery worker --pool prefork --concurrency 4
 576602 pts/7    S+     0:00 celery worker --pool prefork --concurrency 4
 576603 pts/7    S+     0:00 celery worker --pool prefork --concurrency 4
 576604 pts/7    S+     0:00 celery worker --pool prefork --concurrency 4
 576605 pts/7    S+     0:00 celery worker --pool prefork --concurrency 4
 578258 pts/8    S+     0:00 grep --color=auto celery
$ py-spy record --pid 576602
Permission Denied: Try running again with elevated permissions by going 'sudo env "PATH=$PATH" !!'
$ sudo env "PATH=$PATH" py-spy record --pid 576602
py-spy> Sampling process 100 times a second. Press Control-C to exit.

py-spy> Stopped sampling because Control-C pressed
py-spy> Wrote flamegraph data to '576602-2022-07-21T14:04:05-04:00.svg'. Samples: 89 Errors: 0

You can also similarly use the Austin profiler to attach to running processes.

Option #2: Instrumentation inside the process

The pyinstrument profiler lets you profile only a particular part of your code. Instead of attaching to a running process, then, you can modify you task in advance to run with profiling, and then whenever it runs in production it will dump profiling info to disk.

You would modify you tasks Python module to profile the specific task you care about; notice that unlike logging, you don’t have to tell it what to profile in between start() and stop(); whatever slow functions run in between will end up being profiled.

# ... pre-existing imports, creation of app, etc. ...
from pyinstrument import Profiler

def mytask(length):
    profiler = Profiler()
    # ... your actual code ...

Now, if you redeploy and then run your celery worker as normal in production, the logs will include profiling information, for example:

4.293 mytask  ./sandbox/celery-tasks/
├─ 2.518 dumps  json/
│     [7 frames hidden]  json, ..
│        2.418 iterencode  json/
├─ 0.858 [self]  
├─ 0.480 deepcopy_list  ./sandbox/celery-tasks/
│  └─ 0.480 <listcomp>  ./sandbox/celery-tasks/
├─ 0.211 TextIOWrapper.write  ./sandbox/celery-tasks/<built-in>:0
│     [2 frames hidden]  ..
├─ 0.151 list.append  ./sandbox/celery-tasks/<built-in>:0
│     [2 frames hidden]  ..
└─ 0.075 fsync  ./sandbox/celery-tasks/<built-in>:0
      [2 frames hidden]  ..

Again, we can see that JSON encoding is using much of the time. You can also output HTML reports to disk, see the pyinstrument documentation for details.

The problem with this approach is that pyinstrument, like most profilers, is not designed for production usage; using it in production might have a significant performance impact, depending on where your code’s bottlenecks are.

Bonus option: Add lots more logging

If neither of the above approaches works, you also have the option of adding more and more logging or tracing. Eventually based on production logs you will manage to isolate the part of the code that is the bottleneck, at which point you can switch to local machine profiling if the issue is CPU and you haven’t been able to isolate it further.

Measurement is hard, but often necessary

Some performance problems are so obvious you won’t spend any time findings them.

Other, less obvious but still predictable in advance, can be caught with good logging or tracing. Slow SQL queries are a classic example.

Then, there are problems that will only be caught by measuring performance with a profiler. If you’re lucky, you’ll be able to do profiling on your development machine. But if not, you’ll need to measure performance in production, where the code is running. As middle option, a staging or testing environment that is as close as possible to production may also suffice.

If none of the above options for measuring performance in production are attractive, you can also try out Sciagraph, a performance observability service for Python batch jobs. It’s designed to run in production (with no need for root access), it has low performance overhead so it won’t impact runtime, and it has built-in Celery support.