Reproducible and upgradable Conda environments: dependency management with conda-lock
If your application uses Conda to manage dependencies, you face a dilemma.
On the one hand, you want to pin all your dependencies to specific versions, so you get reproducible builds.
On the other hand, once you’ve pinned everything, upgrades become difficult: you’ll start encountering the infamous
The following specifications were found to be incompatible with each other error.
Ideally you’d be able to both have a consistent, reproducible build, and still be able to quickly change your dependencies. And you can do this—with a little understanding, and a bit more work.
In this article you’ll learn:
- Three ways of specifying your dependencies, and how they impede and/or enable reproducibility and upgrades.
- Why in practice you want to have two different dependency files.
- How to use a third-party tool,
conda-lock, to easily maintain these two different files.
Three kinds of dependency specification
We have two goals, reproducibility and upgradability; we will limit our discussion to just dependencies, keeping in mind that both goals require more work than the limited focus of this article.
Focusing just on dependencies:
- Reproducibility: If we reinstall the code, we should get the same libraries.
- Upgradability: We should be able to change versions of our dependencies without having to fight the packaging system.
Let’s see how different ways of specifying dependencies can achieve these goals.
Let’s say your application depends on Python and Pandas.
You create an
environment.yml for your dependencies:
name: example channels: - conda-forge dependencies: - python - pandas
You install Python and Pandas with
conda env create, write some code, it runs correctly, all is well with the world.
Now, time passes, and you want to rerun the same analysis, with just a minor change to the code. Unfortunately, there’s been a new release of Pandas in the interim—and with a new release, there are differences:
- It may have dropped some old APIs
- Some APIs might behave differently on purpose, if the old behavior was a bug.
- New bugs might have been introduced.
If you have to recreate the environment, it will install the latest version of Pandas: you might get different results, or maybe your code won’t run at all. Similarly, there might be a new version of Python, which can also cause problems.
On the flip side, from the perspective of the packaging infrastructure upgrades are trivial to achieve: they happen automatically every time you recreate your environment.
Versioned direct dependencies
Given how bad our
environment.yml is at reproducible installs, we can constrain what it installs a little.
We can add a version specifier to Python, Pandas, or both.
name: example channels: - conda-forge dependencies: - python=3.8 - pandas=1.0
This is a versioned direct dependency list; “direct” meaning “this package is something I directly import or run.”
This is an improvement in terms of reproducibility, but it still has issues. If you create a new environment with this file, you’ll see that many other packages are installed, dependencies of dependencies, and those are dependencies whose versions you didn’t specify:
$ conda env create ... $ conda activate example (example) $ conda list # packages in environment at /home/itamarst/.conda/envs/example: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 1_gnu conda-forge ca-certificates 2020.11.8 ha878542_0 conda-forge certifi 2020.11.8 py38h578d9bd_0 conda-forge ld_impl_linux-64 2.35.1 hed1e6ac_0 conda-forge libblas 3.9.0 2_openblas conda-forge libcblas 3.9.0 2_openblas conda-forge libffi 3.2.1 he1b5a44_1007 conda-forge libgcc-ng 9.3.0 h5dbcf3e_17 conda-forge # ... etc.
All of those dependencies might change out from under you whenever you recreate the environment. For example, if you installed NumPy 1.19 this time, the next time you install you might get NumPy 1.20, which in theory could have a bug that changes Pandas’ results.
On the other hand, upgrades are still pretty easy: to switch to Python 3.9, just change the version of Python.
Transitively-pinned dependencies, aka locked dependencies
Once you that environment with all the dependencies installed, you can create new
environment.yml that has the exact versions of all dependencies, including dependencies of dependencies.
This is the “transitively-pinned” or “locked” dependency list, which you can create with
conda env export:
(example) $ conda env export > environment.lock.yml (example) $ cat environment.lock.yml name: example channels: - conda-forge - defaults dependencies: - _libgcc_mutex=0.1=conda_forge - _openmp_mutex=4.5=1_gnu - ca-certificates=2020.11.8=ha878542_0 - certifi=2020.11.8=py38h578d9bd_0 - ld_impl_linux-64=2.35.1=hed1e6ac_0 - libblas=3.9.0=2_openblas - libcblas=3.9.0=2_openblas - libffi=3.2.1=he1b5a44_1007 - libgcc-ng=9.3.0=h5dbcf3e_17 # ... etc.
In practice there are some technical issues with using
conda env export, but we’ll put those aside for now.
For now, we can just notice that every time you create a new environment with this locked dependency file, you will get the exact same packages installed. As far as reproducibility is concerned, this is ideal.
But there’s a problem: upgrades are going to be hard. Let’s say we want to switch to Python 3.9, so we edit the YAML file to say that. Then we try to install:
$ conda env create -n example2 -f environment.lock.yml Collecting package metadata (repodata.json): done Solving environment: \ Found conflicts! Looking for incompatible packages. This can take several minutes. Press CTRL-C to abort. UnsatisfiableError: The following specifications were found to be incompatible with each other: Output in format: Requested package -> Available versions Package pip conflicts for: setuptools==49.6.0=py38h924ce5b_2 -> python[version='>=3.8,<3.9.0a0'] -> pip pytz==2020.4=pyhd8ed1ab_0 -> python[version='>=3'] -> pip python=3.9 -> pip python-dateutil==2.8.1=py_0 -> python -> pip certifi==2020.11.8=py38h578d9bd_0 -> python[version='>=3.8,<3.9.0a0'] -> pip pip==20.2.4=py_0 # ... at this point there are pages and pages of output ...
So we now have reproducibility, but upgrades are quite difficult, perhaps even impossible without starting from scratch.
Choosing how to specify dependencies
Let’s summarize what we’ve learned about the three kinds of dependency specifications:
|Direct||❌ Awful||✓ Automatic|
|Versioned direct||😐 OKish||✓ Easy|
|Transitively pinned||✓ Great||❌ Awful|
None of these options are ideal. But we can get both reproducibility and upgradability by having two files.
- You use the versioned direct file to generate the locked dependency file.
- When creating an environment, you use the locked dependency file.
This gives you the best of both worlds: most of the time you are just creating a new, reproducible environment from the locked dependency file. When you want to upgrade, you regenerate the locked file, and since you’re starting with a versioned direct dependency list, the hope is that the changes of dependencies-of-dependencies won’t be too bad.
And even if something breaks, at least it’ll break at a time of your choosing, rather than every time you recreate the environment.
Some technical difficulties with
conda env export
In the example above we used
conda env export to generate the locked file from the current environment; the environment in turn was created from the versioned direct
This has some issues:
- You may have manually installed some files without adding them to
environment.yml; the export will grab those too, so now your
- The export file has a
pathentry at the end, which you probably want to delete before using.
- Conda has a bug where channels are exported in random order, instead of the sort order in the original
- Different operating systems might install different packages; if you
conda env exporton macOS, for example, the resulting lock file won’t work in Docker.
Luckily, there’s a tool that solves these issues.
Rather than creating an
conda-lock creates a “lock file”, which is basically a set of URLs to download.
This has the benefits of:
- Speeding up installs, since you don’t have to wait for the Conda package resolver.
- Allowing for reproducible builds, by transitively pinning the dependencies.
In addition, you can specify which operating system you want to build the lock file for, so you can create a Linux lock file on other operating systems. By default it generates for Linux, macOS, and 64-bit Windows out of the box, which is very convenient.
As a reminder, our
environment.yml looks like this:
name: example channels: - conda-forge dependencies: - python=3.8 - pandas=1.0
Here’s how you run
conda-lock (you can install
pip install conda-lock):
$ conda install -c conda-forge conda-lock $ conda-lock ... generating lockfile for osx-64 generating lockfile for linux-64 generating lockfile for win-64 To use the generated lock files create a new environment: conda create --name YOURENV --file conda-linux-64.lock $ ls conda-linux-64.lock conda-osx-64.lock conda-win-64.lock environment.yml
As explained in that message, you can now create an environment from the lock files, with a slightly different syntax that a normal
conda env crate
In my case, I’ll use the Linux version:
$ conda create --name fromlock --file conda-linux-64.lock ... $ conda activate fromlock (fromlock) $ python Python 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05) [GCC 7.5.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas >>> pandas.__version__ '1.0.1'
And here’s what the
conda-linux-64.lock file looks like:
# platform: linux-64 # env_hash: 1c5ab33dc2ffb2cdf714d63bec414d84ea77143c13958b1743d374052986892f @EXPLICIT https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2#d7c89558ba9fa0495403155b64376d81 https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2020.11.8-ha878542_0.tar.bz2#f9cdccd43ac20a0d1637d84d58c6ff5c https://conda.anaconda.org/conda-forge/linux-64/ld_impl_linux-64-2.35.1-hed1e6ac_0.tar.bz2#d0cf77c331382475133dc6c34e7461d7 ...
As requested, Python 3.8 and Pandas 1.0.
One caveat to keep in mind:
pip dependencies won’t be included in the lockfile.
To deal with those you can just have a two stage install, where you
pip install once the environment has been created.
There are a number of tools for managing lock files for
Since that was a lot, here’s what we’ve learned:
- You want both reproducibility and easy updates.
- Versioned direct dependency files (“the packages my code imports”) give you easy updates; locked dependency files give you reproducibility.
- In practice, you want both.
conda-locklets you turn a direct dependency
environment.ymlinto a lock file listing specific versions of the transitive dependencies.
Not using a lock file? It’s quick and easy: go and do it right now, and your builds will be reproducible going forward.
Learn how to build fast, production-ready Docker images—read the rest of the Docker packaging guide for Python.
Production Docker packaging is too complicated to learn from Google searches
With as much as a dozen different intersecting technologies, and an unknown number of details to get right, Docker packaging isn't simple, especially for production.
But you still need fast builds that save you time, and security best practices that keep you safe.
Take the fast path to learning best practices, by using the Python on Docker Production Handbook.
⬐ Get your free ebook! ⬎
“Introduction to Dockerizing for Production”
Learn a step-by-step iterative DevOps packaging process in this free mini-ebook. You'll learn what to prioritize, the decisions you need to make, and the ongoing organizational processes you need to start.
Plus, you'll join my email list and get weekly articles covering practical tools and techniques, from Docker packaging to Python best practices.