What linking means when installing a Conda package
)
Understanding Package Linking in Conda
Reflinks, Hardlinks, and Copy Operations
Package managers face a fundamental challenge: how to efficiently place files from a package cache into multiple environments without excessive disk usage or compromising isolation. Pixi (and Conda) address this through three distinct linking strategies, each with specific trade-offs and file system requirements.
The Linking Problem
When Pixi installs a package, it must transfer files from its central package cache (you can find the location with pixi info
, on my system it is at /Users/wolfv/Library/Caches/rattler/cache
) to the target environment. A straightforward file copy would work but becomes problematic at scale. Consider a system with 20 environments, each containing PyTorch (~2.5GB). Copying would consume 50GB for identical files.
The solution involves three linking methods, applied in order of preference: reflinking, hardlinking, and copying.
Copy Operations
File copying represents the fallback mechanism when other methods aren't available. The operation duplicates data blocks from source to destination:
cp /path/to/conda/pkgs/numpy-1.24.3/lib/python3.11/site-packages/numpy/* \
/path/to/env/lib/python3.11/site-packages/numpy/
This approach guarantees compatibility across all file systems and provides complete isolation between environments. However, it consumes disk space linearly with the number of environments and requires time proportional to file size.
Hardlinks
Hardlinks create additional directory entries pointing to the same inode. The file system maintains a reference count, deleting the underlying data only when all links are removed.
ln /path/to/conda/pkgs/numpy-1.24.3/lib/libnpymath.so \ /path/to/env/lib/libnpymath.so
Key characteristics of hardlinks:
Source and destination must reside on the same file system
Modifications to the file content affect all hardlinked copies
Most POSIX-compliant file systems support hardlinks (ext4, NTFS, HFS+)
Cannot hardlink directories
Conda does not yet use reflinks (unlike Pixi). The main drawback of hardlinks is that modifications to files in the environment (e.g. for debugging) will be reflected in all other environments referencing the file (and the original cache). For this reason, pixi uses reflinks by default.
Reflinks (Copy-on-Write)
Reflinks leverage file system-level copy-on-write semantics. Initially, both source and destination share the same data blocks. When either file is modified, the file system automatically creates new blocks for the changed portions only.
# Creating a reflink on supported systems
cp --reflink=always /path/to/source /path/to/dest
The implementation varies by file system but generally involves:
Metadata pointing both files to shared data blocks
Block-level copy-on-write triggered by modifications
Transparent to applications - files appear independent
This fixes the main drawback of hardlinks: files that are modified in one environment are not modified in the other environments (copied before they are written to).
In Pixi we can use the excellent reflink-copy crate.
File System Support Matrix
Reflink support depends on the underlying file system:
Btrfs: Full support since inception. Reflinks are created using
cp --reflink
or theFICLONE
ioctl.XFS: Requires explicit enablement at file system creation time:
mkfs.xfs -m reflink=1 /dev/device
APFS: Apple's file system supports cloning through
clonefile()
system call, functionally equivalent to reflinks. APFS is the default on all modern macOS devices.ZFS: Implements similar functionality through dataset clones, though with different semantics than file-level reflinks.
ext4: No reflink support. The ext4 development team has prioritized stability over copy-on-write features.
NTFS: Windows' primary file system lacks reflink support, though ReFS provides similar functionality. You can make a "Dev Drive" for Windows that uses this new technology.
Windows DevDrive: On Windows, it's highly recommended to setup a "Dev Drive" for a high-speed ReFS filesystem that supports reflinks.
Practical Considerations
File system choice impacts Pixi and Conda performance significantly. For development machines with numerous environments, reflink-capable file systems provide substantial benefits when used with Pixi. On Windows, we can definitely recommend adding a Dev Drive and installing environments + cache there.
Package cache location matters for hardlinking. The cache and environments must reside on the same file system partition. Cross-device links fail, forcing fallback to copying.
To modify the cache and environment location with Pixi, you can use the PIXI_CACHE_DIR (or even RATTLER_CACHE_DIR for all rattler-based tools) to set the cache location. To also move all environments to the same filesystem you can utilize the "detached environments" mode with pixi (documentation).
Conclusion
Package linking in Conda represents a careful balance between disk efficiency, installation speed, and environment isolation. Understanding these mechanisms helps in making informed decisions about file system configuration and troubleshooting installation issues. As file systems evolve and reflink support becomes more prevalent, the efficiency gap between different platforms will narrow, benefiting all Pixi (and hopefully soon also Conda) users regardless of their operating system choice.
