Making pip installs a little less slow
Installing your Python application’s dependencies can be surprisingly slow. Whether you’re running tests in CI, building a Docker image, or installing an application, downloading and installing dependencies can take a while.
So how do you speed up installation with
In this article I’ll cover:
- Avoiding the slow path of installing from source.
pipdownload speed, and the alternatives: Pipenv and Poetry.
- A useful
pipoption that can, sometimes, speed up installation significantly.
Avoiding installs from source
When you install a Python package, there are two ways you can install it, typically:
- The packaged up source file, often a
setup.py. In this case, installing will often require running Python code (a little slow), and sometimes compiling large amounts of C/C++/Rust code (potentially extremely slow).
- A wheel (
.whlfiles) that can just be unpacked straight on to the filesystem, with no need to run code or compile native extensions.
If at all possible, you want to install wheels, because installing from source will be slower. If you need to compile significant amounts of C code, installing from source will be much slower; instead of relying on precompiled binaries, you’ll need to compile it all yourself.
To ensure you’re installing wheels as much as possible:
- Make sure you’re using the latest version of
pipbefore installing dependencies. Binary wheels sometimes require newer versions of
pipthan the one packaged by default by your current Python.
- Don’t use Alpine Linux; stick to Linux distributions that use
glibc, e.g Debian/Ubuntu/RedHat/etc.. Standard Linux wheels require
glibc, but Alpine uses the
muslC library. Wheels for
musl-based distributions like Alpine are starting to become available, but they’re still not as common.
Comparing installation speed between
pip, Pipenv, and Poetry
Installing Python packages involves two steps:
- Downloading the package.
- Installing the already downloaded package.
By default, Python package managers will cache downloaded packages on disk, so if you install them a second time in a different virtualenv the package won’t need to be re-downloaded. I therefore measured both variants: a cold cache where the package had to be downloaded, and a warm cache where the package was already available locally.
In all cases I made sure to create the virtualenvs in advance, and for
pip I made sure to use hashes in the
requirements.txt, to match the hash validation that the other two package managers do by default.
I used the transitive dependencies for installing
matplotlib, resulting in the installation of 12 different packages in total.
Here’s how long each installation took, measuring both wallclock and CPU time:
|Tool||Cache||Wallclock time||CPU time|
Some things to notice:
pipis the slowest by wallclock time when the cache is cold.
- Wallclock time isn’t really that different between any of them when the cache is warm, i.e. the packages are already downloaded.
- Both Pipenv and Poetry use parallelism, as we can see from CPU time that is higher than wallclock time;
pipis currently single-threaded.
- Pipenv uses quite a lot of CPU compare to the other two; Poetry is a bit better, but still higher than
This example was run with 12 packages being installed; with a larger number of dependencies, it’s possible that Poetry’s parallel installation would have more of an impact.
Keeping the cache warm
Notice that in all cases you get a speedup from having a warm cache, i.e. reusing already downloaded packages. On your local machine, that happens automatically. In most CI services, your cache will start out empty.
To work around that, most CI systems will have some way to store a cache directory at the end of the run, and then load it at the beginning of the next run. If you’re using GitHub Actions, you can use the built-in caching support in the action used to setup Python.
This is still not as fast as running on a dedicated machine, however: storing and loading the cache also takes time.
Going (very slightly) faster by disabling the version check
pip may check if you’re running the latest version or not, and print a warning if you’re not.
You can disable this check like so:
pip --disable-pip-version-check install ...
This saves me about 0.2-0.3s, not a very significant improvement; the actual improvement probably depends on your network speed and other factors.
Going faster (sometimes) with disabled compilation
Can we do better? In some cases, yes.
After packages are downloaded (if they’re not cached locally) and installed on to the filesystem, package managers do one final step: they compile the
.py source files into
.pyc bytecode files, and store them in
This is not the same as compiling a C extension, this is just an optimization to make loading Python code faster on startup.
Instead of having to compile the
.pyc at import time, the
.pyc is already there.
It turns out that bytecode compilation takes a significant amount of the time spent by
But you can disable this step by calling
pip install --no-compile.
Here’s a comparison of how long it takes to install packages both with and without
.pyc compilation, in both cases when the cache is warm so no downloads are needed:
|Installation method||Cache||Wallclock time||CPU time|
So should you always use this option?
pip install is faster doesn’t mean you’ve saved time overall.
Any module you import will still need to be compiled into a
.pyc, it’s just that the work will happen at Python run time, instead of at package installation time.
So if you’re importing all or most modules, overall you might not save any time at all, you’ve just moved the work to a different place.
In other cases, however,
--no-compile will save you time.
For example, in your testing setup you might be installing many third-party packages for integration testing, but only using a small amount of those libraries’ code.
As such, there’s no point in compiling lots of modules you won’t be using.
Neither Pipenv nor Poetry seem to support this option at this time.
Package installation could be much faster
Given how many people use Python, slow package installations add up.
It’s difficult to estimate how many
pip installs are happening in the world, but
pip itself was downloaded 100 million times in the month previous to writing this article, so we can take that as a lower bound.
If you could shave just 1 second off of every one of those 100 million installs, that would be 3.17 years of waiting saved every month.
There is clearly a lot of room for improvement in package installation in the Python world:
- Poetry already implements parallelism to some extent, but it doesn’t seem to be as efficient as one might hope, given higher CPU usage than
pip. But it may already be faster on wallclock basis for larger number of dependencies.
- Pipenv’s CPU usage is even worse.
- In a world where multiple CPUs are the default, and single core speed increases have stalled, pretty much every CPU-based task
pipdoes could benefit from parallelism:
- Parallel downloads and version verification would also be helpful; for small package sizes, network latency is the likely bottleneck, something parallelism can help with.
If you’re interested in helping, the
pip repository has a number of issues and in-progress PRs covering various aspects.
Finally, if you maintain open source Python packages: since wheels install faster, make sure to provide wheels for your package, even if it’s pure Python.
Data processing too slowly? Cloud compute bill too high?
You can get faster results from your data science pipeline—and get some money back too—if you can just figure out why your code is running slowly.
Identify performance bottlenecks and memory hogs in your production data science Python jobs with Sciagraph, the always-on profiler for production batch jobs.
Learn practical Python software engineering skills you can use at your job
Sign up for my newsletter, and join over 6500 Python developers and data scientists learning practical tools and techniques, from Python performance to Docker packaging, with a free new article in your inbox every week.