What linking means when installing a Conda package

Cover image for What linking means when installing a Conda package

Understanding Package Linking in Conda

Reflinks, Hardlinks, and Copy Operations

Package managers face a fundamental challenge: how to efficiently place files from a package cache into multiple environments without excessive disk usage or compromising isolation. Pixi (and Conda) address this through three distinct linking strategies, each with specific trade-offs and file system requirements.

The Linking Problem

When Pixi installs a package, it must transfer files from its central package cache (you can find the location with pixi info, on my system it is at /Users/wolfv/Library/Caches/rattler/cache) to the target environment. A straightforward file copy would work but becomes problematic at scale. Consider a system with 20 environments, each containing PyTorch (~2.5GB). Copying would consume 50GB for identical files.

The solution involves three linking methods, applied in order of preference: reflinking, hardlinking, and copying.

Copy Operations

File copying represents the fallback mechanism when other methods aren't available. The operation duplicates data blocks from source to destination:

cp /path/to/conda/pkgs/numpy-1.24.3/lib/python3.11/site-packages/numpy/* \
   /path/to/env/lib/python3.11/site-packages/numpy/

This approach guarantees compatibility across all file systems and provides complete isolation between environments. However, it consumes disk space linearly with the number of environments and requires time proportional to file size.

Hardlinks create additional directory entries pointing to the same inode. The file system maintains a reference count, deleting the underlying data only when all links are removed.

ln /path/to/conda/pkgs/numpy-1.24.3/lib/libnpymath.so \ /path/to/env/lib/libnpymath.so

Key characteristics of hardlinks:

  • Source and destination must reside on the same file system

  • Modifications to the file content affect all hardlinked copies

  • Most POSIX-compliant file systems support hardlinks (ext4, NTFS, HFS+)

  • Cannot hardlink directories

Conda does not yet use reflinks (unlike Pixi). The main drawback of hardlinks is that modifications to files in the environment (e.g. for debugging) will be reflected in all other environments referencing the file (and the original cache). For this reason, pixi uses reflinks by default.

Reflinks leverage file system-level copy-on-write semantics. Initially, both source and destination share the same data blocks. When either file is modified, the file system automatically creates new blocks for the changed portions only.

# Creating a reflink on supported systems
cp --reflink=always /path/to/source /path/to/dest

The implementation varies by file system but generally involves:

  • Metadata pointing both files to shared data blocks

  • Block-level copy-on-write triggered by modifications

  • Transparent to applications - files appear independent

This fixes the main drawback of hardlinks: files that are modified in one environment are not modified in the other environments (copied before they are written to).

In Pixi we can use the excellent reflink-copy crate.

File System Support Matrix

Reflink support depends on the underlying file system:

  • Btrfs: Full support since inception. Reflinks are created using cp --reflink or the FICLONE ioctl.

  • XFS: Requires explicit enablement at file system creation time:

mkfs.xfs -m reflink=1 /dev/device
  • APFS: Apple's file system supports cloning through clonefile() system call, functionally equivalent to reflinks. APFS is the default on all modern macOS devices.

  • ZFS: Implements similar functionality through dataset clones, though with different semantics than file-level reflinks.

  • ext4: No reflink support. The ext4 development team has prioritized stability over copy-on-write features.

  • NTFS: Windows' primary file system lacks reflink support, though ReFS provides similar functionality. You can make a "Dev Drive" for Windows that uses this new technology.

  • Windows DevDrive: On Windows, it's highly recommended to setup a "Dev Drive" for a high-speed ReFS filesystem that supports reflinks.

Practical Considerations

File system choice impacts Pixi and Conda performance significantly. For development machines with numerous environments, reflink-capable file systems provide substantial benefits when used with Pixi. On Windows, we can definitely recommend adding a Dev Drive and installing environments + cache there.

Package cache location matters for hardlinking. The cache and environments must reside on the same file system partition. Cross-device links fail, forcing fallback to copying.

To modify the cache and environment location with Pixi, you can use the PIXI_CACHE_DIR (or even RATTLER_CACHE_DIR for all rattler-based tools) to set the cache location. To also move all environments to the same filesystem you can utilize the "detached environments" mode with pixi (documentation).

Conclusion

Package linking in Conda represents a careful balance between disk efficiency, installation speed, and environment isolation. Understanding these mechanisms helps in making informed decisions about file system configuration and troubleshooting installation issues. As file systems evolve and reflink support becomes more prevalent, the efficiency gap between different platforms will narrow, benefiting all Pixi (and hopefully soon also Conda) users regardless of their operating system choice.

Written on by:
Wolf Vollprecht
Wolf Vollprecht