Faster hardware is a bad first solution to slow software
Your data pipeline is too slow, or uses too much memory. How should you speed it up?
One obvious solution is purchasing better hardware. With cloud computing, switching to a computer with more cores, or adding more RAM, can be done in minutes or seconds. Given that developer time is expensive, switching to more powerful hardware is often seen as a cheap first solution to slow software.
But there are longer-term costs involved that aren’t immediately visible. If your first solution to any performance problem is spending more money on hardware, you may eventually end up with software that is unnecessarily slow, hard to speed up, and extremely expensive.
So how do you decide if faster hardware is the correct solution to your software performance problems? In this article we’ll discuss:
- What money can buy you in terms of hardware.
- Why hardware won’t always help.
- Why faster hardware shouldn’t always be your first solution even when it does help.
- Changing the tradeoff by making it easier to create efficient software from the start.
Thanks to Moshe Zadka, Nelson Elhage, and Alex Gaynor for inspiring this article.
Speeding up your code with faster hardware
If your code is running too slowly, or using too much memory, you can spend some money to get access to more powerful hardware. There are two basic approaches:
- Renting hardware in the cloud: You can pay, by the minute, for access to virtual machines or even dedicated machines. You can choose what hardware configuration you want (CPU, RAM, disk, etc..) or a case by case basis, and you can spin resources up and down on demand.
- Buying hardware: If you expect to be using a single computer extensively, it can be cheaper to buy a computer rather than renting one. As an individual, when you buy a desktop machine you can also get far better performance for the same amount of money compared to a laptop. Some organizations also choose to build their own clusters or data centers.
Programmer time is expensive: in the US, programmers’ time might cost their employer $100/hour (or far far more once you add in opportunity cost, since the expectation is that the value an employee produces is much higher than their cost.) So when you can buy a very capable computer for $2000, or rent a virtual machine with 384GiB RAM and 96 vCPUs for $8.60/hour, it may seem cheaper to just solve performance problems by paying for faster hardware.
Unfortunately, there are significant caveats to using more expensive hardware as your first solution to performance problems.
Hardware isn’t always enough
Whether or not faster hardware will help depends on where the bottleneck is. Cloud machines can have as much as 24 TiB RAM (that’s 24576GiB!), and as many as 192 cores. So if you just need more RAM, or you’re dealing with software that can easily run in parallel, you can go pretty far by using more expensive hardware.
In other situations, there are hard limits on how fast your hardware can go.
Single-core speed is a hard limit
In some cases, your processing speed is tied to the speed of a single CPU core, because your software doesn’t parallelize well:
- This is a common occurrence in Python programs, due to the Global Interpreter Lock.
- It’s not just Python, though: on Linux, most linkers don’t take full advantage of multiple cores.
moldlinker fixes this, and has some informative performance comparisons.
- Some algorithms are very difficult to parallelize.
- Even in software that can take advantage of multiple cores, some algorithms may parallelize onlyup to a certain number of cores, and then stop scaling due to the nature of the algorithm or the data.
This is a problem, because unlike the increasing number of cores available on modern CPUs, single-core performance hasn’t increased much over the past few years. Comparing a 2014-era 4-core Xeon CPU to what is available 8 years later:
- For just US$450, you can buy a CPU with 7× the multi-core performance. Spend enough money, and you can get as high as 20× the multi-core speed of this computer, and likely more in the near future.
- In contrast, after 8 years of R&D by Intel, AMD, and Apple, the CPU with the fastest single-core performance is only 2.5× faster on single-core performance.
Overall, single-core performance is going up much more slowly than multi-core performance. If single-core performance is your bottleneck, there’s a very hard limit on what performance improvements you can get from hardware, no matter how much money you have to spend.
The downsides of the hardware-first approach
Even if faster hardware helps, using it as your first solution to performance problems can have longer term costs above and beyond paying for the hardware itself. These include:
- A culture of inefficiency.
- Horizontal scaling costs.
- Vertical scaling costs.
- Greenhouse emissions.
A culture of inefficiency
If you provide developers with a cluster that scales significantly, or just let them access VMs of any size, they won’t have the motivation to learn how to write fast code. For a small group of people, this may be fine. But with many people over time, you will end up creating software that is far slower than it has to be, because developers will be less motivated to learn the necessary skills.
Horizontal scaling: multiplicative costs
If you’re running a data processing job only a few times, paying an extra $5 for cloud computing is no big deal. But if you’re running the same job 1,000 times a month, that extra additional cost is now adding up to $60,000/year.
Vertical scaling: hitting architectural breakpoints sooner
Once you hit the point where scaling on a single machine is a problem, if you want to keep scaling with hardware you need to make the leap to a distributed system. Switching to a distributed system may require significant changes to your software, and potentially a significant jump in complexity of debugging.
At this point the promise of hardware-only improvement is immediately undermined, since you probably also have to spend time changing the software. You might also see a regression in performance per machine, in which case the increases in hardware costs as you scale will be proportionally higher than then they were when you were scaling a single machine.
Slower code forces this expensive architectural shift to happen sooner.
Consider two algorithms that solve the same problem, one requiring
O(N) CPU time and one requiring
O(N²) CPU time.
- Because the
O(N)algorithm runtime grows so much more slowly, it will be usable within the same architectural paradigm for much larger input sizes.
O(N²)solution will not only use more resources while it still fits in a single computer, it will also force you to change paradigms much sooner once you need to start handling larger input sizes.
Switching to the
O(N) algorithm is probably a better approach, and using it from the start would have been even better.
Data centers are creating an increasing percentage of global greenhouse emissions. This is a negative side-effect that is not taken into account in the pricing of computing, but you still have an obligation to reduce it.
Shifting the tradeoff towards efficient software
So far we’ve been assuming a static tradeoff: you can spend money on hardware, or you can spend developer time to write more efficient code. You estimate the two costs, pick whichever is lower in your situation, and revisit when the situation changes.
But there’s another way to consider the problem. All other things being equal, a more efficient program is better than a less efficient program. So while faster hardware will continue be the appropriate solution in many cases, it’s also worth considering how you can make your software more efficient by default.
1. Don’t assume hardware is the only solution
It’s very easy to assume that slow or inefficient software is inevitable and unavoidable. And if you believe that, you might not even consider how to make your software faster.
But in many cases it’s quite possibly to have massive speedups in runtime. For example, here’s a scientist who sped up a computation by 50×, from 8 hours to 10 minutes. Only 2× of that was due parallelism, so that’s a 25× single-core improvement. I give another example of a 50× speedup in my article on the difference between optimizing and parallelizing.
Similarly, software doesn’t have to use a lot of memory to process large datasets. Switching to a streaming/batched-based approach can move your memory usage from linearly scaling with data size to a small, fixed amount of memory use, often with no impact on run time.
2. Reduce the costs of writing efficient software
Once you’re willing to accept that faster, more efficient software is possible, the question is how to do so without incurring extra development costs. How can you write more efficient software with the same amount of time? And how can you reduce the costs of optimizing existing software?
Improve your skills
Your ability to write efficient code is not fixed—you have the ability to improve it.
Focusing on memory usage as an example: if you’re parsing a large JSON file, using streaming JSON parsing will reduce memory usage significantly. And in many cases it’s not really any more work! You swap out two lines of code for two other lines of code, and structure your code very slightly differently. Mostly you just need to know the solution exists.
And to be clear, before I wrote that article, I did not know about the existence of the streaming
But I knew exactly what to look for, because batched/streaming data processing is one of the basic techniques for processing large datasets in a memory-efficient way.
The same high-level technique applies to populating Pandas dataframes from SQL queries, for example, just with different details and APIs.
And as an example from CPU runtime, being able to identify quadratic algorithms can help you avoid a common performance pitfall, with very little effort.
Elsewhere on this site you’ll find articles I’ve written on performance optimization and reducing memory usage, and there are plenty of other resources available to improve your skills. A little time investment now can result in significant time and money savings in the future.
Increase runtime visibility
Beyond improving your skills, there’s also the need for visibility into why your code is slow or using too much memory. Partially this is about using the appropriate tools during development. For Python programs, for example:
VizTracerand other tools allow you to measure performance in different ways.
- Fil allows you to measure peak memory, and
memory-profilercan give you line by line allocations and deallocations.
- The Sciagraph performance and memory profiler is designed for Python data processing jobs.
You also want to increase performance visiblity in production, since many performance problems are only visible with real data or in the actual environment where your code is running. The Sciagraph profiler is designed to support production profiling of data processing batch jobs; for other domains, like web applications, you might reach to APM or observability tools, or other continuous profilers.
Given better information and improved skills, you can spend the same amount of time coding and produce software that runs faster and uses fewer resources. You can also optimize your software much more quickly, if you need to.
That doesn’t mean you won’t end up spending money on renting or buying computer hardware. But writing efficient, fast software is a skill you can learn, and it doesn’t necessarily require a huge investment. An hour spent learning a new skill might be applied to many future software projects. And the corresponding increased efficiency gives you benefits that scale in a positive way:
- If you’re scaling horizontally, the lower costs from efficient software are multiplicative.
- It will take longer to hit architectural breakpoints.
- Your software will produce fewer greenhouse emissions.
Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler
Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.
Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support, on macOS and Linux, with built-in support for Jupyter notebooks.
Learn practical Python software engineering skills you can use at your job
Sign up for my newsletter, and join over 7000 Python developers and data scientists learning practical tools and techniques, from Python performance to Docker packaging, with a free new article in your inbox every week.