Reproducible and upgradable Conda environments: dependency management with conda-lock

If your application uses Conda to manage dependencies, you face a dilemma. On the one hand, you want to pin all your dependencies to specific versions, so you get reproducible builds. On the other hand, once you’ve pinned everything, upgrades become difficult: you’ll start encountering the infamous The following specifications were found to be incompatible with each other error.

Ideally you’d be able to both have a consistent, reproducible build, and still be able to quickly change your dependencies. And you can do this—with a little understanding, and a bit more work.

In this article you’ll learn:

  • Three ways of specifying your dependencies, and how they impede and/or enable reproducibility and upgrades.
  • Why in practice you want to have two different dependency files.
  • How to use a third-party tool, conda-lock, to easily maintain these two different files.

Three kinds of dependency specification

We have two goals, reproducibility and upgradability; we will limit our discussion to just dependencies, keeping in mind that both goals require more work than the limited focus of this article.

Focusing just on dependencies:

  1. Reproducibility: If we reinstall the code, we should get the same libraries.
  2. Upgradability: We should be able to change versions of our dependencies without having to fight the packaging system.

Let’s see how different ways of specifying dependencies can achieve these goals.

Direct dependencies

Let’s say your application depends on Python and Pandas. You create an environment.yml for your dependencies:

name: example
channels:
  - conda-forge
dependencies:
  - python
  - pandas

You install Python and Pandas with conda env create, write some code, it runs correctly, all is well with the world.

Now, time passes, and you want to rerun the same analysis, with just a minor change to the code. Unfortunately, there’s been a new release of Pandas in the interim—and with a new release, there are differences:

  • It may have dropped some old APIs
  • Some APIs might behave differently on purpose, if the old behavior was a bug.
  • New bugs might have been introduced.

If you have to recreate the environment, it will install the latest version of Pandas: you might get different results, or maybe your code won’t run at all. Similarly, there might be a new version of Python, which can also cause problems.

On the flip side, from the perspective of the packaging infrastructure upgrades are trivial to achieve: they happen automatically every time you recreate your environment.

Versioned direct dependencies

Given how bad our environment.yml is at reproducible installs, we can constrain what it installs a little. We can add a version specifier to Python, Pandas, or both. For example:

name: example
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pandas=1.0

This is a versioned direct dependency list; “direct” meaning “this package is something I directly import or run.”

This is an improvement in terms of reproducibility, but it still has issues. If you create a new environment with this file, you’ll see that many other packages are installed, dependencies of dependencies, and those are dependencies whose versions you didn’t specify:

$ conda env create
...
$ conda activate example
(example) $ conda list
# packages in environment at /home/itamarst/.conda/envs/example:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
ca-certificates           2020.11.8            ha878542_0    conda-forge
certifi                   2020.11.8        py38h578d9bd_0    conda-forge
ld_impl_linux-64          2.35.1               hed1e6ac_0    conda-forge
libblas                   3.9.0                2_openblas    conda-forge
libcblas                  3.9.0                2_openblas    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.3.0               h5dbcf3e_17    conda-forge
# ... etc.

All of those dependencies might change out from under you whenever you recreate the environment. For example, if you installed NumPy 1.19 this time, the next time you install you might get NumPy 1.20, which in theory could have a bug that changes Pandas’ results.

On the other hand, upgrades are still pretty easy: to switch to Python 3.9, just change the version of Python.

Transitively-pinned dependencies, aka locked dependencies

Once you that environment with all the dependencies installed, you can create new environment.yml that has the exact versions of all dependencies, including dependencies of dependencies. This is the “transitively-pinned” or “locked” dependency list, which you can create with conda env export:

(example) $ conda env export > environment.lock.yml
(example) $ cat environment.lock.yml
name: example
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_gnu
  - ca-certificates=2020.11.8=ha878542_0
  - certifi=2020.11.8=py38h578d9bd_0
  - ld_impl_linux-64=2.35.1=hed1e6ac_0
  - libblas=3.9.0=2_openblas
  - libcblas=3.9.0=2_openblas
  - libffi=3.2.1=he1b5a44_1007
  - libgcc-ng=9.3.0=h5dbcf3e_17
# ... etc.

In practice there are some technical issues with using conda env export, but we’ll put those aside for now.

For now, we can just notice that every time you create a new environment with this locked dependency file, you will get the exact same packages installed. As far as reproducibility is concerned, this is ideal.

But there’s a problem: upgrades are going to be hard. Let’s say we want to switch to Python 3.9, so we edit the YAML file to say that. Then we try to install:

$ conda env create -n example2 -f environment.lock.yml
Collecting package metadata (repodata.json): done
Solving environment: \ 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
              
UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package pip conflicts for:
setuptools==49.6.0=py38h924ce5b_2 -> python[version='>=3.8,<3.9.0a0'] -> pip
pytz==2020.4=pyhd8ed1ab_0 -> python[version='>=3'] -> pip
python=3.9 -> pip
python-dateutil==2.8.1=py_0 -> python -> pip
certifi==2020.11.8=py38h578d9bd_0 -> python[version='>=3.8,<3.9.0a0'] -> pip
pip==20.2.4=py_0

# ... at this point there are pages and pages of output ...

So we now have reproducibility, but upgrades are quite difficult, perhaps even impossible without starting from scratch.

Choosing how to specify dependencies

Let’s summarize what we’ve learned about the three kinds of dependency specifications:

Dependency specification Reproducibility Upgradability
Direct ❌ Awful ✓ Automatic
Versioned direct 😐 OKish ✓ Easy
Transitively pinned ✓ Great ❌ Awful

None of these options are ideal. But we can get both reproducibility and upgradability by having two files.

  1. You use the versioned direct file to generate the locked dependency file.
  2. When creating an environment, you use the locked dependency file.

This gives you the best of both worlds: most of the time you are just creating a new, reproducible environment from the locked dependency file. When you want to upgrade, you regenerate the locked file, and since you’re starting with a versioned direct dependency list, the hope is that the changes of dependencies-of-dependencies won’t be too bad.

And even if something breaks, at least it’ll break at a time of your choosing, rather than every time you recreate the environment.

Some technical difficulties with conda env export

In the example above we used conda env export to generate the locked file from the current environment; the environment in turn was created from the versioned direct environment.yml. This has some issues:

  1. You may have manually installed some files without adding them to environment.yml; the export will grab those too, so now your environment.yml and environment.lock.yml won’t match.
  2. The export file has a path entry at the end, which you probably want to delete before using.
  3. Conda has a bug where channels are exported in random order, instead of the sort order in the original environment.yml.
  4. Different operating systems might install different packages; if you conda env export on macOS, for example, the resulting lock file won’t work in Docker.

Luckily, there’s a tool that solves these issues.

Locking with conda-lock

Rather than creating an environment.yml, conda-lock creates a “lock file”, which is basically a set of URLs to download. This has the benefit of speeding up installs, since you don’t have to wait for the Conda package resolver.

In addition, you can specify which operating system you want to build the lock file for, so you can create a Linux lock file on other operating systems. By default it generates for Linux, macOS, and 64-bit Windows out of the box, which is very convenient.

As a reminder, our environment.yml looks like this:

name: example
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pandas=1.0

I’ve had some trouble running conda-lock directly—hopefully it’ll work for you—but it worked fine in a Docker container:

$ docker run -v ${PWD}:/data -w /data python:3.8-slim-buster \
    bash -c "pip install conda-lock && conda-lock"
...
generating lockfile for osx-64
generating lockfile for linux-64
generating lockfile for win-64
To use the generated lock files create a new environment:

     conda create --name YOURENV --file conda-linux-64.lock
$ ls
conda-linux-64.lock  conda-osx-64.lock  conda-win-64.lock  environment.yml

As explained in that message, you can now create an environment from the lock files, with a slightly different syntax that a normal conda env crate In my case, I’ll use the Linux version:

$ conda create --name fromlock --file conda-linux-64.lock
...
$ conda activate fromlock
(fromlock) $ python
Python 3.8.6 | packaged by conda-forge | (default, Oct  7 2020, 19:08:05) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'1.0.1'

And here’s what the conda-linux-64.lock file looks like:

# platform: linux-64
# env_hash: 1c5ab33dc2ffb2cdf714d63bec414d84ea77143c13958b1743d374052986892f

@EXPLICIT

https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2#d7c89558ba9fa0495403155b64376d81
https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2020.11.8-ha878542_0.tar.bz2#f9cdccd43ac20a0d1637d84d58c6ff5c
https://conda.anaconda.org/conda-forge/linux-64/ld_impl_linux-64-2.35.1-hed1e6ac_0.tar.bz2#d0cf77c331382475133dc6c34e7461d7
...

As requested, Python 3.8 and Pandas 1.0.

One caveat to keep in mind: pip dependencies won’t be included in the lockfile. To deal with those you can just have a two stage install, where you pip install once the environment has been created. There are a number of tools for managing lock files for pip/PyPI packages.

Recap

Since that was a lot, here’s what we’ve learned:

  1. You want both reproducibility and easy updates.
  2. Versioned direct dependency files (“the packages my code imports”) give you easy updates; locked dependency files give you reproducibility.
  3. In practice, you want both.
  4. conda-lock lets you turn a direct dependency environment.yml into a lock file listing specific versions of the transitive dependencies.

Not using a lock file? It’s quick and easy: go and do it right now, and your builds will be reproducible going forward.


Learn how to build fast, production-ready Docker images—read the rest of the Docker packaging guide for Python.