The best way to find performance bottlenecks: observing production
Your customers are complainin’, your monitors are alertin’, your thumbs are a-twiddlin’—whatever the symptom, the problem is that your application is too slow. And you want to find out why, so you can fix it.
You could spin up your application on your laptop, do some benchmarking, and try to find the bottleneck. Sometimes, that’s all it takes, but quite often, local testing tells you nothing useful. For many performance bottlenecks, the only way to identify the problem is to measure in production.
To understand why, we’ll need to go over all the many ways production might differ from your laptop. In this article, we’ll be focusing on server-based applications where your organization has set up the production environment (including cloud hosting); applications run by third-parties involve even more difficulties.
Some of the differences between development and production that we’ll mention include:
- CPU, memory, and disk speeds.
- Network latency.
- Data inputs.
- Database contents and configuration.
- Just a hint of the complexities involved with cloud services, and more.
- Software versions.
Why you can’t reproduce performance problems locally: a very partial list
There are many different reasons why performance problems in production might not show up in local testing. Being able to identify these reasons can be helpful in debugging performance problems, both as potential problems to check on, and as ways to make local testing more realistic.
Keep in mind, however, that what follows is only a partial list.
Hardware resources: CPU
Your local computer typically has a different CPU and core count than your production server. This can a result in a variety of differences, some more expected than others:
- Different core speeds: A 12th-generation Intel desktop, or any of Apple’s M1/M2 machines, probably has 1.75×-2× the single core speed of most (all?) servers you are likely to run in production.
- Different # of CPU cores: Your development machine might only have 4 cores, where your server has 32. Or perhaps it’s the other way around, with your server running on a 4-core VM and your local machine being a 32-core. The performance effects can be unexpected; for example, this fun story about how a 24-core server was slower than a laptop with 4 cores.
- Different CPU architectures: The performance characteristics of an ARM M2 chip from Apple are different than those of an x86_64 server; sometimes this won’t matter, sometimes it will matter a lot.
- Different CPU micro-architectures: Even if everything is x86_64, your local machine might not support the same SIMD operations as the server you’re running on, which matters for some data processing applications that can use SIMD to significantly speed up calculations.
- Thermal and power throttling: Your laptop might stop running at full speed in order to prevent overheating, or conserve power when unplugged. A server in a cooled data-center doesn’t have those issues.
To maximize realism of local testing, you’ll want a desktop machine with as similar a CPU as possible (architecture, number of cores, etc.) to the one you’re running in production. If you still want to use a laptop, make sure it’s plugged in!
Hardware resources: Memory and disk
Your local computer may well have a very different amount of RAM than your server. Run out of RAM, and your program might run slowly due to swapping, or crash in a variety of ways, so hitting low memory at different RAM levels can significantly impact performance.
Your local disk may also run at different speeds. For example, depending what you provision, an EBS disk on AWS might range from 250 IOPS to 64,000 IOPS.
Again, for maximum realism you can try to match the hardware specs of production.
Your server might talk to a remote service in the same datacenter: a database, an object storage like AWS S3, and so on. When doing local development, it’s easy enough to emulate this either by:
- Running it locally on your development machine, for example run the
postgrescontainer image to get a local PostgreSQL server.
- Connecting from your local machine to a remote service in the cloud, to match production more closely, especially in cases where is not easy to run the service locally.
While these strategies might be fine for development or finding bugs, they might not provide realistic performance data because of network latency: the time it takes messages to travel to a different server.
Let’s say your code sends 100,000 queries sequentially to a remote service like a database; how long will that take if you’re using one of these two strategies?
|Testing strategy||Latency||→ Overhead of sequential 100K queries|
|Local machine||0.05 ms||5,000 ms|
|Remote datacenter||50 ms||5,000,000 ms|
That’s a huge span, and it’s not clear whether either of those would actually match the network latency within the datacenter. A local network might only have a 1ms latency, but cloud datacenters have much more complex networking architectures; AWS S3 docs suggests a latency of 100-200ms per query, though not all of that is network overhead.
If you want an accurate measure of network latency, you need to measure in production, or at least in a staging environment similar to production.
The specifics of data inputs can make a huge difference to performance outcomes. For example:
- Quadratic algorithms are easy to create by mistake, and might run quickly with test data and extremely slowly with only slightly-larger real-world data.
- Real-world inputs might be the same size as test inputs, but with very different noise or data distribution.
- Real-world inputs might vary significantly, with some inputs running quickly and other running slowly.
As a result, performance problems may only be reproducible with real inputs, or at least semi-realistic inputs. This means you need some way to obtain them from your production environment. If only some inputs cause problems, you will also need to identify which inputs are problematic, which you can only do in production.
If your server relies on a database, the contents of the database may significantly impact performance. To give a common example, imagine a simple query that can use an index if it’s available, but will otherwise require a table scan (i.e. looking at every record in the table):
|Table size||Ops with index||Ops without index|
If you’re using a small test database, the lack of an index simply won’t show up as a performance problem. In production with a large database, however, it can have a massive performance impact.
Even if the table is the same size, different contents, tuning, and configurations might also impact performance.
Cloud services’ contractual performance limits
When you use for services from a cloud provider, you will get performance behavior that, while documented, does not necessarily match what you would get from a normal computer. For example, some services in some configurations will include bursty performance:
- Some AWS EBS disk configurations will allow short bursts of faster I/O, but will then revert to a slower baseline. Your local disk, in contrast, will typically given you consistent behavior over time.
- CPU usage might also be bursty for some configurations: this is fine for short-lived queries in a web server, but less so for long-running computations. While local CPUs might have this behavior with things like “turboboost”, the constraint is typically related to temperature, rather than a contractually agreed upon and enforced-in-software lower level of service (typically in return for paying a lower price).
Cloud services’ empirical performance behavior
While some performance behavior is part of your contract when you choose a particular cloud service configuration, other behaviors are empirical. That is, while you can observe them, and might design around them, these performance characteristics are not guaranteed and might change, as they are tied to internal implementation details you have no insight into.
To give some examples:
- Object stores like AWS S3 map keys, a string like
"myprefix/myobject", to a binary blob. S3’s performance documentation implies that the limits on parallel queries are per-prefix. This is an implementation detail, not necessarily tied to the abstract API; if you’re testing with a local S3-like system like MinIO you will have different performance characteristics, and specific production limits might change over time in ways you don’t control.
- Different filesystems have different performance characteristics. If you’re using a network filesystem like AWS EFS, or an automatically-mounted filesystem like Fly.io’s volumes, the underlying filesystem is not something you get to pick. You might be able to ask, or figure it out empirically, but the specific performance characteristics (can you have 10,000 files in a folder without slowing down?) are not necessarily guaranteed, and might change.
Reproducible builds can ensure that your software is the same across environments. But that’s just part of the software involved in running an application: there’s the operating system kernel, hypervisors and/or container runtimes, container orchestration where relevant, third-party services like databases… different software versions in any of these might, at times, have a performance impact.
Performance should be measured in production
Given all of the above, measuring performance in production is far more likely to give you accurate information than testing locally. Performance information can be extracted from trace-based logging, or normal logging if you can’t manage that, or you can use generic always-on profiling tools like Pyroscope, Prodfiler, or if you’re running Python data processing jobs, a more specialized service like Sciagraph. Ideally you will have access both to trace-based logging and profiling.
At the very minimum, observation in production will often be critical in narrowing down your focus to a solution you can then simulate in a development environment. Beyond that minimum, with good observability you might be able to pinpoint the exact problem immediately.
And if you’re running Python-based data pipelines and want to understand their performance and memory bottlenecks in production, with the real environment, real data, and real inputs, check out the Sciagraph profiler. It supports profiling both in development and in production.
Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler
Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.
Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support and more.
Learn practical Python software engineering skills you can use at your job
Sign up for my newsletter, and join over 6700 Python developers and data scientists learning practical tools and techniques, from Python performance to Docker packaging, with a free new article in your inbox every week.