Pip vs Conda: an in-depth comparison of Python’s two packaging systems

If you’re using Python in the world of data science or scientific computing, you will soon discover that Python has two different packaging systems: pip and Conda. Which raises some questions:

  • How are they different?
  • What are the tradeoffs between the two?
  • Which should you use?

While it’s not possible to answer this question for every situation, in this article you will learn the basic differences, constrained to:

  • Python only; Conda has support for other languages but I won’t go into that.
  • Linux, including running on Docker, though with some mention of macOS and Windows.
  • Focusing on the Conda-Forge package repository; Conda has multiple package repositories, or “channels”.

By the end you should understand why Conda exists, when you might want to use it, and the tradeoffs between choosing each one.

The starting point: which kind of dependencies?

The fundamental difference between pip and Conda packaging is what they put in packages.

  • Pip packages are Python libraries like NumPy or matplotlib.
  • Conda packages include Python libraries (NumPy or matplotlib), C libraries (libjpeg), and executables (like C compilers, and even the Python interpreter itself).

Pip: Python libraries only

For example, let’s say you want to install Python 3.9 with NumPy, Pandas, and the gnuplot rendering tool, a tool that is unrelated to Python. Here’s what the pip requirements.txt would look like:

numpy
pandas

Installing Python and gnuplot is out of scope for pip. You as a user must deal with this yourself. You might, for example, do so with a Docker image:

FROM ubuntu:20.04
RUN apt-get update && apt-get install -y gnuplot python3.9
COPY requirements.txt .
RUN pip install -r requirements.txt

Both the Python interpreter and gnuplot need to come from system packages, in this case Ubuntu’s packages.

Conda: Any dependency can be a Conda package (almost)

With Conda, Python and gnuplot are just more Conda packages, no different than NumPy or Pandas. The environment.yml that corresponds (somewhat) to the requirements.txt we saw above will include all of these packages:

name: myenv
channels:
  - conda-forge
dependencies:
  - python=3.9
  - numpy
  - pandas
  - gnuplot

Conda only relies on the operating system for basic facilities, like the standard C library. Everything above that is Conda packages, not system packages.

We can see the difference if the corresponding Dockerfile; there is no need to install any system packages:

FROM continuumio/miniconda3
COPY environment.yml .
RUN conda env create

This base image ships with Conda pre-installed, but we’re not relying on any existing Python install, we’re installing a new one in the new environment.

Note: Outside any specific best practice being demonstrated, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.

Python on Docker Production Handbook Make sure your production software is packaged securely, efficiently, and quickly: Read the pragmatic, thorough, and concise Python on Docker Production Handbook.

Why Conda packages everything

Why did Conda make the decision to package everything, Python interpreter included? How does this benefit you? In part it’s about portability and reproducibility.

  1. Portability across operating systems: Instead of installing Python in three different ways on Linux, macOS, and Windows, you can use the same environment.yml on all three.
  2. Reproducibility: It’s possible to pin almost the whole stack, from the Python interpreter upwards.
  3. Consistent configuration: You don’t need to install system packages and Python packages in two different ways; (almost) everything can go in one file, the environment.yml.

But it also addresses another problem: how to deal with Python libraries that require compiled code. That’s a big enough topic that it gets a whole new section, next.

Beyond pure Python: Packaging compiled extensions

In the early days of Python packaging, a package included just the source code that needed to be installed. For pure Python packages, this worked fine, and still does. But what happens when you need to compile some Rust or C or C++ or Fortran code as part of building the package?

Solution #1: Compile it yourself

The original solution was to have each user compile the code themselves at install time. This can be quite slow, wastes resources, is often painful to configure, and still doesn’t solve a big part of the problem: shared library dependencies.

The Pillow image graphics library, for example, relies on third party shared libraries like libpng and libjpeg. In order to compile Pillow yourself, you have to install all of them, plus their development headers. On Linux or macOS you can install the system packages or the Homebrew packages; for Windows this can be more difficult. But you’re going to have to write different configuration for every single OS and even Linux distribution.

Solution #2: Pip wheels

The way pip solves this problem is with packages called “wheels” that can include compiled code. In order to deal with shared library dependencies like libpng, any shared library external dependencies get bundled inside the wheel itself.

For example, let’s look at a Pillow wheel for Linux; a wheel is just a ZIP file so we can use standard ZIP tools:

$ zipinfo Pillow.whl
...
Pillow.libs/libpng16-213e245f.so.16.37.0
Pillow.libs/libjpeg-183418da.so.9.4.0
...
PIL/FpxImagePlugin.py
PIL/PalmImagePlugin.py
...
PIL/_imagingcms.cpython-39-x86_64-linux-gnu.so
...

The wheel includes both Python code, a compiled Python extension, and third-party shared libraries like libpng and libjpeg. This can sometimes make packages larger, as multiple copies of third-party shared libraries may be installed, one per wheel.

Solution #3: Conda packages

Conda packages take a different approach to third-party shared libraries. libjpeg and libpng are packaged as additional Conda packages:

$ conda install -c conda-forge pillow
...
The following NEW packages will be INSTALLED:

...
  jpeg               conda-forge/linux-64::jpeg-9d-h36c2ea0_0
...
  libpng             conda-forge/linux-64::libpng-1.6.37-h21135ba_2
...
  pillow             conda-forge/linux-64::pillow-7.2.0-py38h9776b28_2
  zstd               conda-forge/linux-64::zstd-1.5.0-ha95c52a_0
...

Those installed libjpeg and libpng can then be depended on by other installed packages. They’re not wheel-specific, they’re available to any package in the Conda environment.

Conda can do this because it’s not a packaging system only for Python code; it can just as easily package shared libraries or executables.

Summary: pip vs Conda

  pip Conda
Installs Python No Yes, as package
3rd-party shared libraries Inside the wheel Yes, as package
Executables and tools No Yes, as package
Python source code Yes, as package Yes, as package

PyPI vs. Conda-Forge

Another fundamental difference between pip and Conda is less about the tools themselves, and more about the package repositories they rely on and how they work. In particular, most Python programs will rely on open source libraries, and these need to be downloaded from somewhere. For these, pip relies on PyPI, whereas Conda supports multiple different “channels” hosted on Anaconda.

The default Conda channel is maintained by Anaconda Inc, the company that created Conda. It tends to have limited package selection and be somewhat less up-to-date, with some potential benefits regarding stability and GPU support. Beyond that I don’t know that much about it.

But there’s also the Conda-Forge community channel, which packages far more packages, tends to be up-to-date, and is where you probably want to get your Conda packages most of the time. You can mix packages from the default channel and Conda-Forge, if you want the default channel’s GPU packages.

Let’s compare PyPI with Conda-Forge.

PyPI

Packages on PyPI are typically uploaded by the author of the Python package. For example, I am the author of the Fil memory profiler, and I also created the PyPI package.

Each package maintainer might compile or build their packages in their own idiosyncratic way, maintaining their own build infrastructure, choosing their own compilation options, and so on.

For example, NumPy can rely on multiple different BLAS libraries for fast linear algebra operations. The maintainers have chosen to build their PyPI packages with OpenBLAS; if you want another option, like Intel’s (maybe?) faster MKL, you’re out of luck unless you’re willing to compile the code yourself.

Conda-Forge

Conda-Forge is a community project where package maintainers can be different than the original author of the package. For example, I have commit access to the typeguard Conda-Forge recipe even though I am not a maintainer of the typeguard library.

Instead of custom builds done differently by each package maintainer, Conda-Forge has centralized build systems that recompile libraries, update recipe repositories, and in general automate everything massively. When a new version of Python 3 comes out, for example, a centralized update will happen, all the individual package maintainers will get PRs adding new packages; on PyPI this is up to individual maintainers to figure out.

Because of packaging infrastructure is centralized, Conda-Forge is able to let you choose which BLAS to use, and it will be used for NumPy and SciPy and whatever other packages you use that rely on BLAS.

Dealing with PyPI-only packages in Conda

While Conda-Forge has many packages, it doesn’t have all of them; many Python packages can only be found on PyPI. You can deal with lack of these packages in a number of ways.

Install pip packages in a Conda environment

Conda environments are wrappers around virtualenvs; as such you can just call pip install yourself. If you’re using an environment.yml to install your Conda packages, you can also add pip packages:

name: myenv
channels:
  - conda-forge
dependencies:
  - python=3.9
  - numpy
  - pandas
  - gnuplot
  - pip:
      # Package that is only on PyPI
      - sandu 

Package it for Conda-Forge yourself

Because Conda-Forge does not require maintainers of the code to do the packaging, anyone can volunteer to add a package to Conda-Forge. That includes you!

For many Python packages it’s surprisingly easy process, and it’s quite automated, so handling new releases is often as easy as approving an automatically-created PR.

Summary: PyPI vs. Conda-Forge

  PyPI Conda-Forge
Who creates package? Author of code Anyone
Build infrastructure Maintained by author Centralized
Open source Python libraries Essentially all Many
Other open source tools None Many
Windows/Linux/macOS packages Usually, but up to maintainer Almost always

Additional tooling for Pip and Conda

Here’s a quick summary of some of the additional tooling you might want to use with either one:

  Pip Conda
Reproducible builds pip-tools, pipenv, Poetry conda-lock
Virtual environments python -m venv, virtualenv Built-in
Security scanning Most security scanners Jake
Alternatives Poetry, pipenv Mamba; much faster, highly recommended

To reiterate: if you do use Conda, I highly recommend using Mamba as a replacement. It supports the same command-line options and is much faster.

Which should you use?

So which should you use, pip or Conda? For general Python computing, pip and PyPI are usually fine, and the surrounding tooling tends to be better.

For data science or scientific computing, however, Conda’s ability to package third-party libraries, and the centralized infrastructure provided by Conda-Forge, means setup of complex packages will often be easier. In the end, which works best for you will depend on your situation and requirements; quite possibly both will be fine.