Blend2D Development Switching to Zen 5

AMD started shipping Zen 5 CPUs in August 2024, but it took me some time to finally get hands on this micro-architecture and to switch the development of Blend2D to Zen 5 on X86_64. The primary reason is to develop code that works well with a full 512-bit implementation of AVX-512, which Zen 5 provides. I have been waiting for the right moment to switch and as a coincidence all Blend2D developers switched to Zen 5 in March 2025.

Introduction

Zen 5 micro-architecture adds a native 512-bit data-path, which means that each instruction that operates on 512-bit vectors is never split into two independent 256-bit operations. This translates to a theoretical performance increase of 2x when it comes to existing AVX-512 code. Zen 5 micro-architecture also adds 2 integer Arithmetic Logic Units (ALUs), which means that even generic code should get some IPC improvement, provided that the code can take advantage of that.

Zen 5 vs Zen 4

Here is a little summary of important Zen 5 micro-architecture improvements compared to Zen 4:

  • The count of integer ALUs increased to 6 from 4, so even a general purpose code should see some improvement if there is enough parallelism. Theoretically the CPU should be able to execute 6 ALU operations per cycle, but nobody ever was lucky to see that in practice. My own measurements show that it's easy to hit 5 operations per cycle, but almost impossible to go above that. In Blend2D case this is okay as most code is not able to saturate integer ALUs.
  • AVX-512 data-path was widened to 512-bit width from 256-bit width, however, the minimum latency of AVX-512 operation increased from 1 to 2 cycles. This means that there must be enough operations in flight to actually address the increase of average latency, otherwise the performance increase would be countered by that.
  • AVX-512 performance is limited by 10 input data ports, so the maximum inputs per cycle is 10 ZMM registers, thus it's not possible to execute 4 FMAs or 4 TERNLOGs per cycle, which is most likely fine as well as even FMA intensive code requires other instructions (like floating point additions or multiplications).
  • AVX-512 code can execute 2 loads and 1 store per cycle (aligned to 512-bits). Since most of the workload in Blend2D is CPU bound this is enough.

References for the information above:

AVX-512 Notes

Since Zen 5 is all about AVX-512 let's talk about the future of it. I often hear opinions such as 512-bit registers are overkill or that it's better to use GPU for the kind of workload that is beneficial for AVX-512. I think these are myths spread by people that don't really understand why AVX-512 exists, so let me highlight what I think is important:

  • X86 ISA is byte based (a single instruction can be 1 to 15 bytes long), so decoding instructions in parallel is a tricky business. The decoders employed by today's X86 hardware are complex beasts and there must be many of them in order to decode at least 4 instructions per cycle (for reference Apple CPUs can decode 8 without any complexity). In general, it's not possible to get the position of a next instruction if you read the first byte in the instruction stream, so the CPU has many decoders that are used in parallel, and the work of decoders that didn't start at the correct position is simply trashed. AVX-512 helps here, because if the software uses 512-bit vectors instruction decoders have less work to do as one instruction that uses 512-bit registers can replace 2 instructions that use 256-bit registers. The proof is actually Zen 4 micro-architecture, which uses double pumping (so has 256-bit ALUs), but AVX-512 code is still faster and uses less energy than an equivalent AVX2 code.
  • AVX-512 provides very important features that are useful beyond a typical SIMD programming tasks. In general, X86 architecture is known for providing a very generous shuffling framework and AVX-512 continues in this direction. Instructions like VPCOMPRESSB, VPERMB, VPERMI2B, VPTERNLOG[D|Q], and those provided by GFNI extension are simply very hard to emulate once you get used to them and have no equivalent instructions for your disposal.
  • AVX-512 masking is another important feature that reduces the work required to write lead & tail loops - simply mask what you don't need and the CPU would do what you have asked for. This can of course be emulated by using jumps or other techniques (and Blend2D does that on non-AVX512 targets), but the emulation is always much longer and slower than simply using masked loads and stores.
  • AVX-512 provides instructions that were missing in AVX2, and there are no other CPU extensions that would offer these operations in AVX/VEX encoded instruction space. There is simply nothing new coming to AVX2; it will be legacy soon.

To add more practical examples: AVX-512 can be used to decode JSON, to help with compression and decompression, to read binary formats such as truetype outlines, etc... And these were just very few cherry-picked examples relevant to the work that I do. I cannot really reveal what I do in my paid time, so these must do...

AVX-512 vs GPU?

Regarding using GPUs instead of AVX-512 - that's probably the biggest misunderstanding of what AVX-512 was designed for. AVX-512 is a commodity; it's an instruction set designed to be used and mixed with your existing code without changing its whole architecture - and that's the biggest advantage of it. GPUs are simply much more difficult to program and integrate into existing code-bases. When it comes to cloud GPUs are much more pricy as you are leaving the commodity space and entering premium. So my opinion is that if there is a workload that can benefit from AVX-512, it's much better and cheaper to utilize AVX-512 than GPUs.

Zen 4 vs Zen 5 Performance Comparison

The following chart provides a visualization of results obtained via bl_bench tool on both 7950X and 9950X CPUs running on the same motherboard and using the same RAM. The version of Blend2D used for benchmarking was 0.12.0. This chart is never going to be updated as I no longer have Zen 4 CPU.

Conclusion

Based on the results I would conclude that Zen 5 delivers. Pure composition workloads (such as rectangle fills, large gradient fills, etc...) improved 2x when rendering at least something that has 32x32 pixels (less pixels translates to a smaller improvement). However, workloads that involve heavy vector geometry processing followed by rasterization didn't improve much, the reason is that AVX-512 code is only used by optimized 2D pipelines (this includes composition).

Future Work

It's obvious that in order to harness the power of today's and future X86 CPUs it's important to use more AVX-512 code. I plan to look into the following in order to improve performance:

  • Stroking - new stroking engine is on its way and it would most likely provide some SIMD to reduce overall latency of stroking.
  • Edge building - at the moment edge-building is a scalar process that has many branches. Edge builder does 3 things - it clips geometry, it flattens curves, and it creates a data structure (edge list) that is used later by the rasterizer. There is a possibility to use SIMD in the clipper itself, as it's adaptive and pretty isolated; and we have talked a lot on our Blend2D chat regarding using SIMD for curve flattening, which should be possible.
  • Rasterization - at the moment rasterizer doesn't use SIMD, because it's hard to do in the current design. However, it should be possible to use SIMD to simply rasterize multiple edges at once, so I'm planning doing a little bit of research in this area as I think this would benefit all workloads where path/polygon rasterization is involved.
  • OpenType - OpenType processing pipeline is currently only using scalar code, but it was designed to be switched to SIMD in the future, so why not to push this further?