Two kinds of threads pools, and why you need both
When you’re doing large scale data processing with Python, threads are a good way to achieve parallelism. This is especially true if you’re doing numeric processing, where the global interpreter lock (GIL) is typically not an issue. And if you’re using threading, thread pools are a good way to make sure you don’t use too many resources.
But how many threads should your thread pool have? And do you need just one thread pool, or more than one?
In this article we’ll see that for data processing batch jobs:
- There are two kinds of thread pools, each for different use cases.
- Each kind requires a different configuration.
- You might need both.
Setting the scene: performance architecture is situational
There is no such thing as performance in the abstract. If we want to choose a software architecture that will be fast, we need to be explicit about what “fast” actually means, and what kind of software we’re discussing.
First, we’re focusing on batch jobs: programs that load some data, processes it, output a result, and then exit. The underlying principles apply to other kinds of applications, but you are likely to come up with different architectures given different requirements.
Second, the performance goal is to have the whole program finish as quickly as possible. Processing intermediate results in a timely manner is not a priority. Put another way, the goal is throughput, and internal tasks’ latency only matters if it impacts throughput. This is in contrast to a web application, for example, where tail latency of individual requests is often very important.
Third, we’re going to assume that your application is the only one running on the machine. That is, all the computer’s or virtual machine’s resources are dedicated to your application’s exclusive use. In a world of cloud computing, virtual machines, and containers, this is a reasonable assumption to make for batch jobs.
Thread pools for CPU-bound tasks
Let’s assume for the moment your program is completely bottlenecked on CPU: it never waits for external resources. Given your goal is to have the overall program finish as quickly as possible, you want to max out all CPU resources. So first, you make sure your code can run in parallel. Then, you have to figure out how many threads you want running.
If your computer has 8 CPU cores, you want a minimum of 8 threads running at any given time so that you are utilizing all 8 cores. What happens when you have more than 8 threads running purely CPU-bound tasks?
- Only 8 threads can run at a given time. The operating system will therefore switch between threads, putting extraneous threads to sleep.
- Context switching between threads can actually slow things down, for example due to worse use of CPU memory caches.
In short, if we have N CPU cores, and we’re only running CPU-bound tasks, we want exactly N threads running at any given time. And a thread pool is a great way to make sure that happens.
Thread pools for network-bound tasks
Next, let’s consider the other extreme, and assume our program is completely network-bound, waiting for responses from network servers. For example, you might be querying a database, or sending metrics to a remote tracking server. If you’re dealing with network-bound tasks, the number of CPU cores is irrelevant, because most of your time is spent waiting, not processing. But you still need some way to handle multiple concurrent operations.
One way to implement this is with an async event loop like
asyncio and corresponding libraries.
If you can stick to async libraries, you don’t need threads at all.
But sometimes either you don’t want to use async, or you’re dealing with client libraries that block.
For an example of the latter, consider
requests.get("https://example.com"): the function won’t return until it gets a response from the server.
In the interim, the thread calling this function can’t do anything else.
Lacking an async event loop, if you want concurrency you need multiple threads. How many threads do you need?
As a first approximation, the number of threads should be at least the number of concurrent blocking operations you want to do. You can do 5 concurrent network requests to a remote server with 5 threads, or 50 concurrent requests with 50 threads, or 500 concurrent requests with 500 threads. If you have more threads hanging around doing nothing, that’s fine, up to a point to at least.
Once you have enough concurrency you will start hitting resource limits, of course: the number of database connections, or memory, or operating system limits on file descriptors, or the remote server’s customer limits. So it’s useful to have an upper bound on the number of concurrent connections, and a thread pool is a fine way to do that. But the pool’s size can be much bigger than your typical concurrency level with no ill effects.
Reading and writing to disk is another kind of task. Usually it’s fast, but sometimes it can be slow if you saturate disk bandwidth, so sometimes using threads may be helpful.
Comparing the two kinds of thread pools
Here’s a summary of the two kinds of tasks and their corresponding thread pool sizes and goals:
|Thread pool goal
|Thread pool size
|Utilize all cores
|Number of CPU cores
|Prevent hitting resource limits
|Higher than desired concurrency, but sufficiently low that you don’t hit other resource limits
Handling a mixture of CPU-bound and network-bound tasks
Many data processing programs will contain a mixture of the two kinds of tasks: CPU and I/O. If you’re using an async event loop for all I/O, your thread pool can be used solely for CPU tasks.
But if you are doing a mixture of CPU-intensive and blocking network operations, having a single thread pool will make your program run more slowly:
- If you size your thread pool based on CPU cores, some of those threads will end up blocking on network operations. As a result, you won’t fully utilize your CPUs. In addition, your concurrency for network tasks will also be limited. This means your code will end up running more slowly for two different reasons!
- If you size your thread pool based on concurrency for I/O operations, you will end up with situations where you’re trying to run more CPU intensive tasks than cores. Sometimes way more tasks than you have cores. As discussed above, this can result in slower computation, and can significantly increase memory usage for some applications.
Instead of trying to have a single thread pool which is incorrectly sized no matter what, this suggests you want (at least) two thread pools:
- A thread pool for CPU-bound tasks, sized by number of cores.
- A thread pool for network-bound tasks, sized by the desired level of concurrency for these tasks.
First, as I discuss in another article, some libraries will have their own internal thread pools, mostly used for computation. If you’re managing application-level thread pools, you probably want to disable those.
Second, if you’re using process pools, similar considerations apply: networking and CPU have different constraints and therefore different needs.
Third, the suggestions in this article are situation-specific (data processing batch jobs), but also somewhat generic. Your application may have different concurrency requirements for different kinds of network tasks, for example a database connection pool with a limited size. So take this advice as a starting point, not as the final answer.