Cover image for Reproducible Notebooks with Pixi

Reproducible Notebooks with Pixi

Wolf Vollprecht
Written by Wolf Vollprecht 2 months ago

Data scientists and researchers love to work with Jupyter Notebooks. It's a great for several reasons: providing a way to interactively explore data, make plots and share the contents in the form of "literate programming" (a term coined by Donald Knuth!).

And many Jupyter Notebook users also like to use conda packages to set up their environment. Before pixi, it was relatively tricky to get reproducible conda environment set up. You could do it by using some community projects like conda-lock or conda-pack, but by no means that was easy.

With pixi, we learned a lot of lessons from other tools – not only from the conda ecosystem, but also from other ecosystems like npm, pip, and docker. We wanted to make it easy to create a reproducible environment, and to share it with others.

Let us walk through the process of creating a shareable, reproducible pixi environment to work with JupyterLab.

First, we start with a pixi.toml file:

[project]
name = "data-explorations"
version = "0.1.0"
description = "A notebook environment to explore data."
authors = ["The user <user@email.com>"]
channels = ["conda-forge"]
platforms = ["linux-64", "osx-arm64", "osx-64", "win-64"]

[tasks]
start = "jupyter lab"

[dependencies]
jupyter-lab = "4.*"

As you can see, we have a few sections in the pixi.toml file:

  • project: This section contains metadata about the project, such as the name, version, description, authors, and the channels to use for the environment.
  • tasks: This section contains a list of tasks that can be run with pixi. In this case, we have a single task called start that runs jupyter lab. Interestingly, we use a syntax that is very similar to bash and runs on all platforms.
  • dependencies: This section contains a list of dependencies that should be installed. These dependencies are installed from the channels specified in the project section. Later we will see how to add pypi dependencies.

To create the environment, we can run pixi install. But it can be even simpler: we can just run pixi run start to execute the start task and pixi will take care of everything: resolving, downloading and installing the environment.

Reproducibilty with lockfiles

If you execute pixi install or pixi run start you might see multiple progress bars. That is because pixi is resolving the environment for all 4 platforms at once. This is a powerful feature of pixi: it can create a lockfile that contains the exact versions of all dependencies for all platforms. This lockfile can be shared with others, and they can use it to recreate the same environment as you have. This lockfile is written as pixi.lock and lives right next to the pixi.toml file. For most projects we advise to check in the lockfile as part of the git repository that hosts the rest of the code, so that you know what versions of the packages were used at the time of the last commit.

version: 4
environments:
  default:
    channels:
    - url: https://conda.anaconda.org/conda-forge/
    packages:
      linux-64:
      - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2
      - conda: https://conda.anaconda.org/conda-forge/linux-64/_openmp_mutex-4.5-2_gnu.tar.bz2
      - conda: https://conda.anaconda.org/conda-forge/linux-64/alsa-lib-1.2.10-hd590300_0.conda
      - conda: https://conda.anaconda.org/conda-forge/linux-64/argon2-cffi-bindings-21.2.0-py311h459d7ec_4.conda
      - conda: https://conda.anaconda.org/conda-forge/linux-64/attr-2.5.1-h166bdaf_1.tar.bz2
      - conda: https://conda.anaconda.org/conda-forge/linux-64/brotli-1.1.0-hd590300_1.conda
      ...

If you check in your lockfile, your coworkers will also have a very easy time getting started: no need to wait for any dependency resolutions! Pixi will just install the environment as specified in the lockfile.

Adding more dependencies

It wouldn't be science if you would not use more than just jupyterlab and the Python standard library! To add more packages to your environment you can either edit the toml file or add them via the pixi CLI:

pixi add numpy

This will add the numpy package to the dependencies section of the pixi.toml file and install it from conda-forge (or any other channel you specified).

Sometimes there may be a dependency that is not yet on conda-forge. In that case, it should be just as easy to add it from PyPI, the Python package index:

pixi add --pypi matplotlib

This will add the matplotlib package to the dependencies section of the pixi.toml file and install it from PyPI. You will find the package under the pypi-dependencies and it's also added to the lockfile. However, our advice is to use conda-packages from conda-forge wherever possible because it's safer and faster to install.

Adding more tasks

If you want to prepare your data, download some datasets from the internet or perform any other repetitive tasks, you can easily add more tasks to your pixi.toml file:

[tasks]
start = "jupyter lab"
prepare = "python prepare_data.py"
index_data = { cmd = "python index_data.py", depends_on = ["prepare"] }

In this example, we have added two new tasks: prepare and index_data. The index_data task depends on the prepare task, so pixi will make sure that prepare is executed before index_data.

Read more

We hope this gives you a good overview of how to use pixi to create a reproducible environment for your Jupyter Notebooks. If you want to learn more, check out the pixi documentation. It also contains descriptions for advanced features such as creating multiple environments with optional dependencies, e.g. for testing or building documentation, so be sure to check the docs out.