Building CPU optimized packages for conda-forge

Modern CPUs have powerful features that can make your code run significantly faster. One of these features is called Single Instruction, Multiple Data (SIMD) instruction sets. A recent article released by the authors of the open-source physics library box2d showcases how using modern CPU instruction sets can make a huge difference in performance. The article achieves a 2x speedup specifically, by using something called "Single Instruction, Multiple Data" or SIMD instructions, we'll get to what this actually means in a bit.

Some examples of libraries specifically targeting these instructions and yielding a considerable improvement are¹:

OpenCV color conversion functionality, ~25x faster on ARM CPUs with Neon: opencv#19883
PyTorch softmax, min and max 3x-4x faster for bfloat16 with AVX2/AVX512 on x86-64: pytorch#55202, and up to 2x-10x with uint8 for +, >>, min: pytorch#89284
As mentioned in the introduction, ~2.5x faster speed-up of 2D collision checking.

While certain specialized libraries like NumPy and PyTorch have always made use of the full potential of your hardware by using dynamic dispatching, other libraries need to be compiled with the right flags to enable these optimizations. Conda-forge (like many other software distributions) aims to be as compatible with not-so-recent hardware as possible, so these optimizations are not enabled by default.

But recently, it became possible to target newer CPU instruction sets on conda-forge directly! Let's quickly cover what SIMD and related terms mean and go over the basics on how to build cpu-optimized packages using either conda-build or rattler-build.

What does SIMD/AVX/NEON even mean?

CPUs execute instructions to perform tasks, and modern CPUs support instruction sets allow enabling processing multiple data points in parallel—which is why the instruction set is called "Single Instruction, Multiple Data" or SIMD.

There are a number of different instruction sets that can be categorized as SIMD, the key ones are:

SSE (Streaming SIMD Extensions): An older instruction set that allows the CPU to perform the same operation on multiple data points at once.
AVX (Advanced Vector Extensions): A more advanced instruction set that extends SSE with more powerful operations for faster data processing.
Neon: An ARM-specific SIMD instruction set found in Apple Silicon, mobile and embedded devices.

Using these SIMD instructions can greatly improve code performance by reducing the number of instructions needed for data processing. Compilers can automatically leverage these SIMD instructions, but your CPU must support the specific sets. Some SIMD sets have been available for years, newer ones may not be supported on all hardware, particularly older devices.

Libraries like NumPy make extensive use of these instructions. These instructions can be enabled in the following ways:

Runtime Selection: Code is compiled for multiple hardware targets, and the best version is chosen at runtime. This approach can boost performance but requires complex engineering and increases the package size.
Just-In-Time Compilation: Libraries like Numba or Pythran can compile code at runtime in order to optimize the code for the specific hardware. This approach can be very powerful but requires additional (large!) dependencies.
Installation Time Selection: The best compiled program is selected during installation, reducing complexity and package size while optimizing for the specific hardware. This approach is supported by the conda-forge ecosystem, simplifying the process and still optimizing performance.

How can I use it today?

If a package that you are maintaining or interested in is available on conda-forge, you can enable these optimizations. By adding these sections to the meta.yaml or recipe.yaml in the conda-forge feedstock, you can start making use of the optimizations today

Recipe v0 (meta.yaml):

{% set build = 0 %}

build:
  # Prioritize builds with a higher microarch level.
  number: {{ build }}          # [not (unix and x86_64)]
  number: {{ build + 100 }}    # [unix and x86_64 and microarch_level == 1]
  number: {{ build + 300 }}    # [unix and x86_64 and microarch_level == 3]
  number: {{ build + 400 }}    # [unix and x86_64 and microarch_level == 4]

requirements:
  build:
    - x86_64-microarch-level {{ microarch_level }}  # [unix and x86_64]
    - {{ compiler('c') }}

microarch_level:  # [unix and x86_64]
  - 1  # [unix and x86_64]
  - 3  # [unix and x86_64]
  - 4  # [unix and x86_64]

Recipe v1 (recipe.yaml):

context:
  build: 0

build:
  # Prioritize builds with a higher microarch level.
  number: ${{ build|int + (microarch_level|int) * 100 }}

requirements:
  build:
    - if: microarch_level|int > 0
      then: x86_64-microarch-level ${{ microarch_level }}
    - ${{ compiler('c') }}

microarch_level:
  - if: not(unix and x86_64)
    then:
      - 0
    else:
      - 1
      - 3
      # Level 4 might not be available on the build machine
      # - 4

What does the above code adds for both recipe.yaml and meta.yaml?

The multiple build numbers allows the solver to prioritize these variants if these are available. Newer architectures get a higher build number to prioritize those over older architectures. E.g AVX gets a higher build number than SSE.
A requirement is added on the microarch package that makes sure that the required compiler flags are set and the package will only run on hardware that supports it.

Also refer to the conda-forge knowledge base on this. This is all that's needed, to enable the users of the package to make use of the compiler optimizations. For a recently merged example in a real-life recipe see the following PR

Note

As you can see in the example, only unix-like operating systems and x86_64 architectures are supported by conda-forge at the time of writing. The CI runners also do not guarantee level=4 for x86_64 so you can only use level<=3 to build.

For more information and a possible workaround see: https://github.com/conda-forge/microarch-level-feedstock/issues/5

Note

Due to a bug, (micro)mamba does not properly report the __archspec virtual package. As a consequence of this, the packages build through this method can not be installed with (micro)mamba.

This issue has already been resolved but as of writing has not yet been released.

See this issues for more information: https://github.com/conda-forge/microarch-level-feedstock/issues/10

Conclusions

As you can see it is fairly straightforward! If any of the packages you maintain benefit from SIMD operations you might want to give this a try!

To recap, to enable SIMD optimizations in your conda-forge package:

Add the x86_64-microarch-level package as a build requirement.
Set the build number based on the microarch_level in the meta.yaml or recipe.yaml.
Add the microarch_level key to the conda_build_config.yaml or variants.yaml file.

As always, feel free to ask us any questions. You can join our Discord and have a chat about building your packages, reach us on X or follow projects on our GitHub.

The Footnotes

These numbers are partially taken from: pypackaging-native

Written on September 6, 2024 by:

Bas Zalmstra