Speeding up software with faster hardware: tradeoffs and alternatives
If you’re writing software to process data, you will often hit performance problems: batch jobs that run too slowly, or use too much memory. One potential solution is purchasing better hardware. With cloud computing, switching to a computer with more cores, or adding more RAM, can be done in a few minutes, or even just a few seconds.
But as with any solution, there are tradeoffs involved. If your first solution to any performance problem is spending more money on hardware, you may eventually end up with software that is unnecessarily slow, hard to speed up, and extremely expensive.
So how do you decide if faster hardware is the correct solution to your software performance problems? In this article we’ll discuss:
- What money can buy you in terms of hardware.
- The limits of hardware as a performance solution.
- The downsides of spending money on hardware.
- Changing the tradeoff by making it easier to create efficient software from the start.
Thanks to Moshe Zadka, Nelson Elhage, and Alex Gaynor for inspiring this article.
Speeding up your code with faster hardware
If your code is running too slowly, or using too much memory, you can spend some money to get access to more powerful hardware. There are two basic approaches:
- Renting hardware in the cloud: You can pay, by the minute, for access to virtual machines or even dedicated machines. You can choose what hardware configuration you want (CPU, RAM, disk, etc..) or a case by case basis, and you can spin resources up and down on demand.
- Buying hardware: If you expect to be using a single computer extensively, it can be cheaper to buy a computer rather than renting one. As an individual, when you buy a desktop machine you can also get far better performance for the same amount of money compared to a laptop. Some organizations also choose to build their own clusters or data centers.
Programmer time is expensive: in the US, programmers’ time might cost their employer $150-$600/hour. So when you can buy a very capable computer for $2000, or rent a virtual machine with 384GiB RAM and 96 vCPUs for $8.60/hour, it’s often much cheaper to just solve performance problems by paying for faster hardware.
The limits of hardware
With a faster CPU, more cores, a faster disk, or more memory, you can run your program faster—depending on where the bottleneck is. The specifics of your bottleneck are important, because money only gets you so far. How far depends on where the more expensive hardware helps.
- Parallelism and RAM can scale pretty well on a single computer.
- A single CPU core has a hard limit on how fast it can run.
Flexible limits: parallelism and RAM
In some cases more powerful hardware is readily available. Cloud machines can have as much as 24 TiB RAM (that’s 24576GiB!), and as many as 192 cores.
- If you need to process 48GB of data in RAM, you can rent or buy 64GB of RAM.
- If your processing can easily be made parallel, you can keep scaling, up to 192 tasks.
For these sort of bottlenecks, more powerful hardware is readily available… up to a point. Once you hit the point where scaling on a single machine is a problem, you need to make the leap to a distributed system.
Switching to a distributed system may require significant changes to your software, and potentially a significant jump in complexity of debugging. You might also see a regression in performance per machine, in which case the increases in hardware costs as you scale will be proportionally higher than then they were when you were scaling a single machine.
Tighter limit: single-core speed
In some cases, your processing speed is tied to the speed of a single CPU core, because your software doesn’t parallelize well:
- This is a common occurrence in Python programs, due to the Global Interpreter Lock.
- It’s not just Python, though: on Linux, most linkers don’t take full advantage of multiple cores; the new
moldlinker fixes this, and has some informative performance comparisons.
- Even in software that can take advantage of multiple cores, some algorithms may parallelize up to a certain number of cores, and then stop scaling due to the nature of the algorithm or the data.
This is a problem, because unlike the increasing number of cores available on modern CPUs, single-core performance hasn’t increased much over the past few years. I’m writing this article on a computer with 2014-era 4-core Xeon CPU, and 8 years later:
- For just US$450, you can buy a CPU with 7× the multi-core performance. Spend enough money, and you can get as high as 20× the multi-core speed of my computer.
- The CPU with the fastest single-core performance is only 2× faster on single-core performance.
Overall, single-core performance is going up much more slowly than multi-core performance. If single-core performance is your bottleneck, there’s a very hard limit on what performance improvements you can get from hardware, no matter how much money you have to spend.
The downsides of the faster-hardware approach
Beyond the immediate monetary cost of paying for faster hardware, there are some longer term costs you need to take into account:
- Horizontal scaling costs.
- Vertical scaling costs.
- Greenhouse emissions.
Horizontal scaling: multiplicative costs
If you’re running a data processing job only a few times, paying an extra $5 for cloud computing is no big deal. But if you’re running the same job 1,000 times a month, that extra additional cost is now adding up to $60,000/year.
Vertical scaling: hitting architectural breakpoints sooner
As we discussed above, eventually you will hit the limits of your hardware, and once you do you will need to make some changes. Depending how complex your software is and what your design assumptions were, this may require a significant rewrite, for example switching from a single computer to a distributed system. And this architectural shift can be quite expensive.
If optimizing your software is not an option, slower code forces architectural shift to happen sooner.
Consider two algorithms that solve the same problem, one requiring
O(N) CPU time and one requiring
O(N²) CPU time.
- Because the
O(N)algorithm runtime grows so much more slowly, it will be usable within the same architectural paradigm for much larger input sizes.
O(N²)solution will not only use more resources while it still fits in a single computer, it will also force you to change paradigms much sooner once you need to start handling larger input sizes.
Switching to the
O(N) algorithm is probably a better approach, and using it from the start would have been even better.
Data centers are creating an increasing percentage of global greenhouse emissions. This is a negative side-effect that is not taken into account in the pricing of computing, but you still have an obligation to reduce it.
Shifting the tradeoffs towards efficient software
So far we’ve been assuming a static tradeoff: you can spend money on hardware, or you can spend developer time to write more efficient code. You estimate the two costs, pick whichever is lower in your situation, and revisit when the situation changes.
But there’s another way to consider the problem. All other things being equal, a more efficient program is better than a less efficient program. So while faster hardware will continue be the appropriate solution in many cases, it’s also worth considering how you can make your software more efficient by default.
- Don’t assume hardware is the only solution.
- Reduce the costs of writing efficient software by improving your skills and improving visibility into runtime performance.
Don’t assume hardware is the only solution
It’s very easy to assume that slow or inefficient software is inevitable and unavoidable. And if you believe that, you might not even consider how to make your software faster.
But in many cases it’s quite possibly to have massive speedups in runtime. For example, here’s a scientist who sped up a computation by 50×, from 8 hours to 10 minutes. Only 2× that was from parallelism, so that’s a 25× single-core improvement. I give another example of a 50× speedup in my article on the difference between optimizing and parallelizing.
Similarly, software doesn’t have to use a lot of memory to process large datasets. Switching to a streaming/batched-based approach can move your memory usage from linearly scaling with data size to a small, fixed amount of memory use, often with no impact on runtime.
Reduce the costs of writing efficient software
Once you’re willing to accept that faster, more efficient software is possible, the question is how to do so without incurring extra development costs. How can you write more efficient software with the same amount of time? And how can you reduce the costs of optimizing existing software?
Improve your skills
Your ability to write efficient code is not fixed—you have the ability to improve it.
Focusing on memory usage as an example: if you’re parsing a large JSON file, using streaming JSON parsing will reduce memory usage significantly. And in many cases it’s not really any more work! You swap out two lines of code for two other lines of code, and structure your code very slightly differently. Mostly you just need to know the solution exists.
And to be clear, before I wrote that article, I did not know about the existence of the streaming
But I knew exactly what to look for, because batched/streaming data processing is one of the basic techniques for processing large datasets in a memory-efficient way.
The same high-level technique applies to populating Pandas dataframes from SQL queries, for example, just with different details and APIs.
And as an example from CPU runtime, being able to identify quadratic algorithms can help you avoid a common performance pitfall, with very little effort.
Elsewhere on this site you’ll find articles I’ve written on performance optimization and reducing memory usage, and there are plenty of other resources available to improve your skills. A little time investment now can result in significant time and money savings in the future.
Increase runtime visibility
Beyond improving your skills, there’s also the need for visibility into why your code is slow or using too much memory. Partially this is about using the appropriate tools during development. For Python programs, for example:
VizTracerand other tools allow you to measure performance in different ways.
- Fil allows you to measure peak memory, and
memory-profilercan give you line by line allocations and deallocations.
You also want to increase performance visiblity in production, since many performance problems are only visible with real data or in the actual environment where your code is running. For Python data processing batch jobs, I’ve created the Sciagraph profiler; for other domains you might reach to APM or observability tools, and continuous profilers.
Given better information and improved skills, you can spend the same amount of time coding and produce software that runs faster and uses fewer resources. You can also optimize your software much more quickly, if you need to.
That doesn’t mean you won’t end up spending money on renting or buying computer hardware. But writing efficient, fast software is a skill you can learn, and it doesn’t necessarily require a huge investment. An hour spent learning a new skill might be applied to many future software projects. And the corresponding increased efficiency gives you benefits that scale in a positive way:
- If you’re scaling horizontally, the lower costs from efficient software are multiplicative.
- It will take longer to hit architectural breakpoints.
- Your software will produce fewer greenhouse emissions.
Data processing too slowly? Cloud compute bill too high?
You can get faster results from your data science pipeline—and get some money back too—if you can just figure out why your code is running slowly.
Identify performance bottlenecks and memory hogs in your production data science Python jobs with Sciagraph, the always-on profiler for production batch jobs.
How do you process large datasets with limited memory?
Get a free cheatsheet summarizing how to process large amounts of data with limited memory using Python, NumPy, and Pandas.
Plus, every week or so you’ll get new articles showing you how to process large data, and more generally improve you software engineering skills, from testing to packaging to performance: