High-Performance PNG Codec

It's been some time I have written about a High-Performance QOI Codec, which joined other codecs offered by Blend2D library in 2024. The development of image codecs continued and now I would like to announce a new high-performance PNG codec, which is much faster than other available codecs written in C, C++, and other programming languages.

Introduction

PNG is not new; it's been with us for decades and it's most likely one of the most deployed image formats ever, which also means that many others focused on improving the quality of PNG codecs. And indeed, there are many libraries that provide speedy PNG decoding, encoding, or both:

libspng - written in C, seems faster than the original libpng
wuffs - written in "WUFFs the language", claims to be The Fastest, Safest PNG Decoder in the World
zune-png - written in rust

We can even find pretty interesting statements such as Memory-safe PNG decoders now vastly outperform C PNG libraries, etc... However, PNG decoding is fortunately not about a programming language!

DEFLATE - The Challenger

The biggest challenge of decoding PNG images is DEFLATE compression algorithm, which is used to compress pixel data, and is the only algorithm allowed by the PNG specification (at least in 2025). DEFLATE aged and is not well suited for modern CPUs that can execute multiple instructions in a single clock cycle in addition to various other features such as out of order execution, branch prediction, etc... In general, everybody who implemented a DEFLATE decoder found a ceiling - at some point it's not possible to improve it further, because it's just inherently scalar and almost impossible to use any kind of SIMD to process the bit-stream.

As a reference, I would like to mention Iguana, which clearly shows that in order to utilize SIMD hardware by decoders, the layout of compressed data must be designed for that. And that's something unfixable when it comes to DEFLATE, because that would make it incompatible.

The Way to High Performance Decoding

In general, high-performance decoding of a DEFLATE stream is about the following:

Quickly build decode tables
Quickly decode bit-stream by using these decode tables

Since the maximum DEFLATE code (both litlen and offset) can have at most 15 bits it's not practical to use a single lookup table for all codes, which is why sub-tables are necessary. Conveniently, compression is all about statistics so long codes are much less likely than short codes, which means that a branch misprediction is not really a problem when a sub-table is hit as it won't be hit that often if decode tables are sized well. However, the challenge is to find the optimal size of these tables, because big tables are slower to build, and smaller tables would mean more sub-table lookups. It seems that the sweet spot for high performance DEFLATE decoding is 11 bits for litlen table and 9 bits for offset table (11/9 limit seem to be now widely used across many decoders), but this is just the beginning when optimizing DEFLATE.

Fast Loop

So, what is the most important part of DEFLATE decoder? Is a fast loop! A loop that does bounds checking at the beginning so it doesn't have to do it during decoding of litlen and offset codes. And to perform multiple iterations of fast loop it's better to precalculate the maximum number of possible iterations and just recalculate that again once that number was exhausted. To make the fast loop fast, it's important to optimize for latency, which means using various tricks such as Reading bits with zero refill latency.

Big Tables Can Become a Bottleneck

Larger decode tables are fine when there are enough bits to decode, however, if the bit-stream is small, building large decode tables could require much more time than the actual decoding of the bit-stream. The solution is to make the table bits dynamic and to use less bits than maximum when possible. Some bigger PNG files are composed of multiple blocks (even dozens) and each block defines its own Huffman tables, thus, table building should be pretty quick.

The Right Data Corpus

Now I would like to highlight something. Most DEFLATE implementations compare performance against a corpus that doesn't represent compressed PNG data. For example a Silesia Compression Corpus is widely used to test implementations of compression and decompression tools, but it doesn't provide data that would match how PNGs are usually compressed. This could mean that tuning a deflate decoder to perform best while decoding Silesia Corpus may not be best for decoding PNG files. The difference between optimal performance can of course vary depending on input images.

For reference, here is a small visual comparison of the structure of a bit-stream that you would see in a Silesia corpus and regular PNG images:

Compressed stream of a file from a Silesia Corpus:
  [Beginning of Data]
  LITERAL
  LITERAL
  ...
  [After Literals]
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  ...

Compressed stream of a PNG file (from testing images):
  [Beginning of Data]
  LITERAL
  LITERAL
  ...
  [After Literals]
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LITERAL
  LITERAL
  LITERAL
  LEN+OFFSET
  LITERAL
  LITERAL
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LITERAL
  LITERAL
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LITERAL
  LITERAL
  LITERAL
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LITERAL
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LITERAL
  LITERAL
  LITERAL
  LEN+OFFSET
  ...

The beginning of each bit-stream is usually pretty similar - a lot of literals, which are used to decode the initial sequence of bytes. However, after the initial sequence is done what comes next varies between compressed files. When decoding files from a Silesia Corpus, LEN+OFFSET codes usually prevail, so tuning decoder to quickly decode & copy matches is generally what could speed it up. However, PNG images have much more literals even after the initial sequence, possibly mixed up with LEN+OFFSET codes, so the decoder is much more challenged. Based on this information I started thinking differently - in order to improve the performance I have to find something others are not doing, especially when it comes to decoding literals!

A Literal Pair

At the moment the most used DEFLATE decoders use a 32-bit entry in a decode table. This entry can hold a lot of data, and it's usually used in a way that it holds additional metadata such as base offset, base length, literal value, and the number of extra length/offset bits, etc... When it comes to literals such entry is under-utilized. A literal is 8 bits, and the other common metadata fits in 8 to 16 bits (depending on layout), which leaves us with up to 3 literals per decode entry.

It turned out that 3 literals limit per decode entry was too much - 11 bits usually won't fit 3 literals and table building suffered a significant penalty. However, a pair of literals per entry is achievable and many would actually fit. So I have settled with maximum 2 literals per entry, which seems a sweet spot - since the maximum table size is 11 bits, a single lookup could decode 2 literals that have at most 11 bits combined, which is likely to happen, especially in PNG images that use filtering before compression.

Streaming Support

PNG image format uses chunks, where each chunk has a tag and payload. There are many tags that can be stored in a PNG image, but for the purpose of decoding let's talk about IDAT. It's a chunk that contains compressed data, however, for some reason PNG allows to split compressed data into multiple IDAT chunks, so it's not possible to use a DEFLATE decoder that needs all input data to be contiguous (such as libdeflate). This also dictates that in order to quickly decode PNG images, the DEFLATE decoder must support streaming.

Repeating Matches

Match in a DEFLATE stream can have length greater than offset, which means that such a match can be used to apply a repeat to an existing sequence of bytes or a single byte (when match offset is 1). This is a nightmare from an implementer perspective, because if you don't want to copy data byte-to-byte the algorithm must be able to handle such repeats. This is actually an ideal place for utilizing SIMD (for example VPSHUFB can be used to implement the repeat on X86 hardware, and ARM has a similar instruction that can be used in the same context).

Summary

Based on the all above we can actually draft an ideal pipeline for decoding a DEFLATE stream to decode PNG data.

Maximum litlen table is 11 bits, maximum offset table is 9 bits
When building tables, use less bits when all codes fit to avoid building large decode tables when decoding small blocks
When there is enough data, add literal pairs to the litlen decode table, and use a decode loop that can take advantage of them
When there is enough space in the output buffer and enough bytes in the input buffer, use a fast loop designed for processing a lot of data to avoid unnecessary checks in the hot path
Using SIMD to repeat sequences of bytes improves performance, but branch there to not waste cycles for regular matches that are non-repeating (that would actually decrease performance)
The rest is about tuning the code - it's always possible to decode multiple literals that are close to each other, to pre-decode entries in advance, etc... This requires experimenting and usually most ideas turn into blunders (making the overall performance worse than before or only benefiting a single PNG file while making all others slower)

PNG Decoding Performance

Before I start showing the performance of decoding PNG images, let's first see where the DEFLATE decoder is when it comes to a Silesia Corpus. It's not the corpus that I have focused on, however, it's an important corpus that is used world-wide for compression benchmarks. For this purpose I have written a tool that uses Blend2D internal API for a moment (as the DEFLATE decoder is currently internal and not part of public API). The results are shown below:

./bl-bench-deflate

File dickens of original size 10192446 compressed to 3736325 (level 8)
[zlib      ] dickens                 : Decompress 20.108 [ms] [177.2 compressed MiB/s]
[libdeflate] dickens                 : Decompress  7.521 [ms] [473.8 compressed MiB/s]
[blend2d   ] dickens                 : Decompress  7.697 [ms] [462.9 compressed MiB/s]

File mozilla of original size 51220480 compressed to 18618185 (level 8)
[zlib      ] mozilla                 : Decompress 98.561 [ms] [180.1 compressed MiB/s]
[libdeflate] mozilla                 : Decompress 42.278 [ms] [420.0 compressed MiB/s]
[blend2d   ] mozilla                 : Decompress 45.036 [ms] [394.3 compressed MiB/s]

File mr of original size 9970564 compressed to 3562465 (level 8)
[zlib      ] mr                      : Decompress 17.078 [ms] [198.9 compressed MiB/s]
[libdeflate] mr                      : Decompress  7.202 [ms] [471.7 compressed MiB/s]
[blend2d   ] mr                      : Decompress  7.799 [ms] [435.6 compressed MiB/s]

File nci of original size 33553445 compressed to 3294801 (level 8)
[zlib      ] nci                     : Decompress 25.200 [ms] [124.7 compressed MiB/s]
[libdeflate] nci                     : Decompress  8.233 [ms] [381.7 compressed MiB/s]
[blend2d   ] nci                     : Decompress  7.916 [ms] [397.0 compressed MiB/s]

File ooffice of original size 6152192 compressed to 3033058 (level 8)
[zlib      ] ooffice                 : Decompress 16.604 [ms] [174.2 compressed MiB/s]
[libdeflate] ooffice                 : Decompress  7.268 [ms] [398.0 compressed MiB/s]
[blend2d   ] ooffice                 : Decompress  7.399 [ms] [390.9 compressed MiB/s]

File osdb of original size 10085684 compressed to 3653950 (level 8)
[zlib      ] osdb                    : Decompress 15.924 [ms] [218.8 compressed MiB/s]
[libdeflate] osdb                    : Decompress  6.000 [ms] [580.8 compressed MiB/s]
[blend2d   ] osdb                    : Decompress  6.160 [ms] [565.6 compressed MiB/s]

File reymont of original size 6627202 compressed to 1759380 (level 8)
[zlib      ] reymont                 : Decompress 10.834 [ms] [154.9 compressed MiB/s]
[libdeflate] reymont                 : Decompress  3.645 [ms] [460.3 compressed MiB/s]
[blend2d   ] reymont                 : Decompress  3.759 [ms] [446.4 compressed MiB/s]

File samba of original size 21606400 compressed to 5330470 (level 8)
[zlib      ] samba                   : Decompress 29.965 [ms] [169.7 compressed MiB/s]
[libdeflate] samba                   : Decompress 11.220 [ms] [453.1 compressed MiB/s]
[blend2d   ] samba                   : Decompress 12.131 [ms] [419.0 compressed MiB/s]

File sao of original size 7251944 compressed to 5270353 (level 8)
[zlib      ] sao                     : Decompress 15.699 [ms] [320.2 compressed MiB/s]
[libdeflate] sao                     : Decompress  8.757 [ms] [574.0 compressed MiB/s]
[blend2d   ] sao                     : Decompress  8.951 [ms] [561.5 compressed MiB/s]

File webster of original size 41458703 compressed to 11993158 (level 8)
[zlib      ] webster                 : Decompress 73.062 [ms] [156.5 compressed MiB/s]
[libdeflate] webster                 : Decompress 25.807 [ms] [443.2 compressed MiB/s]
[blend2d   ] webster                 : Decompress 26.006 [ms] [439.8 compressed MiB/s]

File x-ray of original size 8474240 compressed to 5922522 (level 8)
[zlib      ] x-ray                   : Decompress 22.894 [ms] [246.7 compressed MiB/s]
[libdeflate] x-ray                   : Decompress 12.116 [ms] [466.2 compressed MiB/s]
[blend2d   ] x-ray                   : Decompress 14.015 [ms] [403.0 compressed MiB/s]

File xml of original size 5345280 compressed to 701018 (level 8)
[zlib      ] xml                     : Decompress  5.082 [ms] [131.5 compressed MiB/s]
[libdeflate] xml                     : Decompress  1.623 [ms] [412.0 compressed MiB/s]
[blend2d   ] xml                     : Decompress  1.677 [ms] [398.6 compressed MiB/s]

As can be seen Blend2D's deflate decompressor is very close to libdeflate in terms of performance, but it's not faster when decoding compressed data from a Silesia Corpus. However, let's benchmark how it performs when used to decode PNG images:

[libpng  ] png/harvesters.png        : Decode  9.665 [ms]
[wuffs   ] png/harvesters.png        : Decode  7.984 [ms]
[spng    ] png/harvesters.png        : Decode  8.807 [ms]
[stbimage] png/harvesters.png        : Decode 12.236 [ms]
[blend2d ] png/harvesters.png        : Decode  5.169 [ms]

[libpng  ] png/hibiscus.primitive.png: Decode  0.803 [ms]
[wuffs   ] png/hibiscus.primitive.png: Decode  0.690 [ms]
[spng    ] png/hibiscus.primitive.png: Decode  0.647 [ms]
[stbimage] png/hibiscus.primitive.png: Decode  0.684 [ms]
[blend2d ] png/hibiscus.primitive.png: Decode  0.225 [ms]

[libpng  ] png/hibiscus.regular.png  : Decode  1.627 [ms]
[wuffs   ] png/hibiscus.regular.png  : Decode  1.195 [ms]
[spng    ] png/hibiscus.regular.png  : Decode  1.493 [ms]
[stbimage] png/hibiscus.regular.png  : Decode  1.631 [ms]
[blend2d ] png/hibiscus.regular.png  : Decode  0.679 [ms]

[libpng  ] png/medium_rgb8.png       : Decode 14.313 [ms]
[wuffs   ] png/medium_rgb8.png       : Decode 11.275 [ms]
[spng    ] png/medium_rgb8.png       : Decode 12.811 [ms]
[stbimage] png/medium_rgb8.png       : Decode 16.698 [ms]
[blend2d ] png/medium_rgb8.png       : Decode  6.148 [ms]

[libpng  ] png/medium_rgba8.png      : Decode 19.539 [ms]
[wuffs   ] png/medium_rgba8.png      : Decode 19.882 [ms]
[spng    ] png/medium_rgba8.png      : Decode 18.067 [ms]
[stbimage] png/medium_rgba8.png      : Decode 22.840 [ms]
[blend2d ] png/medium_rgba8.png      : Decode  9.288 [ms]

[libpng  ] png/large_rgb8.png        : Decode 92.882 [ms]
[wuffs   ] png/large_rgb8.png        : Decode 74.817 [ms]
[spng    ] png/large_rgb8.png        : Decode 83.422 [ms]
[stbimage] png/large_rgb8.png        : Decode 95.808 [ms]
[blend2d ] png/large_rgb8.png        : Decode 41.997 [ms]

[libpng  ] png/large_rgba8.png       : Decode 110.803 [ms]
[wuffs   ] png/large_rgba8.png       : Decode 114.620 [ms]
[spng    ] png/large_rgba8.png       : Decode 102.576 [ms]
[stbimage] png/large_rgba8.png       : Decode 122.232 [ms]
[blend2d ] png/large_rgba8.png       : Decode 52.625 [ms]

[libpng  ] png/large_palette.png     : Decode 10.040 [ms]
[wuffs   ] png/large_palette.png     : Decode  4.626 [ms]
[spng    ] png/large_palette.png     : Decode  8.586 [ms]
[stbimage] png/large_palette.png     : Decode  5.328 [ms]
[blend2d ] png/large_palette.png     : Decode  2.944 [ms]

[libpng  ] qoi/dice.png              : Decode  3.081 [ms]
[wuffs   ] qoi/dice.png              : Decode  2.719 [ms]
[spng    ] qoi/dice.png              : Decode  2.559 [ms]
[stbimage] qoi/dice.png              : Decode  3.146 [ms]
[blend2d ] qoi/dice.png              : Decode  1.191 [ms]

[libpng  ] qoi/edgecase.png          : Decode  0.084 [ms] (20x libpng warning: iCCP: known incorrect sRGB profile)
[wuffs   ] qoi/edgecase.png          : Decode  0.016 [ms]
[spng    ] qoi/edgecase.png          : Decode  0.069 [ms]
[stbimage] qoi/edgecase.png          : Decode  0.023 [ms]
[blend2d ] qoi/edgecase.png          : Decode  0.030 [ms]

[libpng  ] qoi/kodim10.png           : Decode  4.267 [ms]
[wuffs   ] qoi/kodim10.png           : Decode  2.853 [ms]
[spng    ] qoi/kodim10.png           : Decode  3.767 [ms]
[stbimage] qoi/kodim10.png           : Decode  3.995 [ms]
[blend2d ] qoi/kodim10.png           : Decode  1.738 [ms]

[libpng  ] qoi/kodim23.png           : Decode  3.828 [ms]
[wuffs   ] qoi/kodim23.png           : Decode  2.650 [ms]
[spng    ] qoi/kodim23.png           : Decode  3.421 [ms]
[stbimage] qoi/kodim23.png           : Decode  3.737 [ms]
[blend2d ] qoi/kodim23.png           : Decode  1.775 [ms]

[libpng  ] qoi/qoi_logo.png          : Decode  0.336 [ms]
[wuffs   ] qoi/qoi_logo.png          : Decode  0.311 [ms]
[spng    ] qoi/qoi_logo.png          : Decode  0.201 [ms]
[stbimage] qoi/qoi_logo.png          : Decode  0.146 [ms]
[blend2d ] qoi/qoi_logo.png          : Decode  0.072 [ms]

[libpng  ] qoi/testcard.png          : Decode  0.197 [ms]
[wuffs   ] qoi/testcard.png          : Decode  0.220 [ms]
[spng    ] qoi/testcard.png          : Decode  0.114 [ms]
[stbimage] qoi/testcard.png          : Decode  0.122 [ms]
[blend2d ] qoi/testcard.png          : Decode  0.056 [ms]

[libpng  ] qoi/testcard_rgba.png     : Decode  0.210 [ms]
[wuffs   ] qoi/testcard_rgba.png     : Decode  0.225 [ms]
[spng    ] qoi/testcard_rgba.png     : Decode  0.148 [ms]
[stbimage] qoi/testcard_rgba.png     : Decode  0.134 [ms]
[blend2d ] qoi/testcard_rgba.png     : Decode  0.060 [ms]

[libpng  ] qoi/wikipedia_008.png     : Decode  9.633 [ms]
[wuffs   ] qoi/wikipedia_008.png     : Decode  7.591 [ms]
[spng    ] qoi/wikipedia_008.png     : Decode  8.237 [ms]
[stbimage] qoi/wikipedia_008.png     : Decode  9.841 [ms]
[blend2d ] qoi/wikipedia_008.png     : Decode  3.220 [ms]

And here is a comparison of the new code compared to the previous PNG decoder (Blend2D only):

[bl-old] png/harvesters.png        : Decode 10.584 [ms]
[bl-new] png/harvesters.png        : Decode  5.169 [ms]

[bl-old] png/hibiscus.primitive.png: Decode  0.517 [ms]
[bl-new] png/hibiscus.primitive.png: Decode  0.225 [ms]

[bl-old] png/hibiscus.regular.png  : Decode  1.417 [ms]
[bl-new] png/hibiscus.regular.png  : Decode  0.679 [ms]

[bl-old] png/medium_rgb8.png       : Decode 12.716 [ms]
[bl-new] png/medium_rgb8.png       : Decode  6.148 [ms]

[bl-old] png/medium_rgba8.png      : Decode 17.443 [ms]
[bl-new] png/medium_rgba8.png      : Decode  9.288 [ms]

[bl-old] png/large_rgb8.png        : Decode 78.050 [ms]
[bl-new] png/large_rgb8.png        : Decode 41.997 [ms]

[bl-old] png/large_rgba8.png       : Decode 93.093 [ms]
[bl-new] png/large_rgba8.png       : Decode 52.625 [ms]

[bl-old] png/large_palette.png     : Decode  5.550 [ms]
[bl-new] png/large_palette.png     : Decode  2.944 [ms]

[bl-old] qoi/dice.png              : Decode  2.406 [ms]
[bl-new] qoi/dice.png              : Decode  1.191 [ms]

[bl-old] qoi/edgecase.png          : Decode  0.026 [ms]
[bl-new] qoi/edgecase.png          : Decode  0.030 [ms]

[bl-old] qoi/kodim10.png           : Decode  3.262 [ms]
[bl-new] qoi/kodim10.png           : Decode  1.738 [ms]

[bl-old] qoi/kodim23.png           : Decode  3.236 [ms]
[bl-new] qoi/kodim23.png           : Decode  1.775 [ms]

[bl-old] qoi/qoi_logo.png          : Decode  0.218 [ms]
[bl-new] qoi/qoi_logo.png          : Decode  0.072 [ms]

[bl-old] qoi/testcard.png          : Decode  0.152 [ms]
[bl-new] qoi/testcard.png          : Decode  0.056 [ms]

[bl-old] qoi/testcard_rgba.png     : Decode  0.160 [ms]
[bl-new] qoi/testcard_rgba.png     : Decode  0.060 [ms]

[bl-old] qoi/wikipedia_008.png     : Decode  9.255 [ms]
[bl-new] qoi/wikipedia_008.png     : Decode  3.220 [ms]

As can be seen above the new PNG decoder is significantly faster than the previous one and in this case the DEFLATE decompressor takes all the credit. We can use perf tool to compare where most cycles were spent before and are spend now:

Blend2D (old DEFLATE):
  73.19%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::DeflateDecoder::_decode()
  10.42%  bl-bench-png  bl-bench-png          [.] bl::Png::Ops::inverseFilterSimdImpl<3u>()
   7.88%  bl-bench-png  bl-bench-png          [.] bl::Png::Ops::inverseFilterSimdImpl<4u>()
   2.04%  bl-bench-png  [kernel.kallsyms]     [k] clear_page_erms
   1.71%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::blDeflateBuildTable()
   1.16%  bl-bench-png  bl-bench-png          [.] bl_convert_premultiply_8888_leading_alpha_shufb_avx2()
   0.80%  bl-bench-png  bl-bench-png          [.] bl_convert_rgb32_from_rgb24_shufb_avx2()
   0.41%  bl-bench-png  bl-bench-png          [.] bl_convert_any_from_indexed8()

Blend2D (new DEFLATE):
  46.47%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::Fast::decode_AVX2()
  17.86%  bl-bench-png  bl-bench-png          [.] bl::Png::Ops::inverseFilterSimdImpl<3u>()
  13.44%  bl-bench-png  bl-bench-png          [.] bl::Png::Ops::inverseFilterSimdImpl<4u>()
   6.15%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::buildFastTable()
   3.79%  bl-bench-png  [kernel.kallsyms]     [k] clear_page_erms
   2.54%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::Decoder::decode()
   1.97%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::buildDecodeTable()
   1.79%  bl-bench-png  bl-bench-png          [.] bl_convert_premultiply_8888_leading_alpha_shufb_avx2()
   1.19%  bl-bench-png  bl-bench-png          [.] bl_convert_rgb32_from_rgb24_shufb_avx2()
   0.67%  bl-bench-png  bl-bench-png          [.] bl_convert_any_from_indexed8()

DEFLATE decompression still burns around ~57% cycles overall, but that's significantly better than burning ~75% cycles as it used to be. And since one part of the whole pipeline is now more optimized, other parts start hitting the profile as well (for example it could be worth optimizing further the inverse PNG filter, which now burns ~30% of cycles).

One additional thing to mention is Deflate::buildFastTable() function, which is used to enhance the initial decode table to provide literal pairs, where applicable. This is actually a nice example of a trade-off. Building such table means burning cycles, but it can pay of if there is enough literals to decode. However, if there is not enough literals then cycles will be simply wasted.

PNG Decoding Performance compared with QOI

The reference implementation of QOI decoder & encoder comes with a tool called qoibench. It's easy to compile and to add more codecs into it; and that's something that I was interested in. So I have added a Blend2D codec there and just ran the benchmark and to my surprise Blend2D PNG decoding is in many cases even faster than the QOI reference decoder (but never faster than Blend2D's optimized QOI codec). So here are some grand totals as reported by qoibench when ran on the QOI benchmark suite, which can be downloaded here:

# Grand total for qoi_benchmark_suite/images/icon_64
            decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:           0.0         0.2        137.65         17.46         3   23.6%
blend2d-png:      0.0         0.1        311.06         73.63         4   25.3%
stbi:             0.0         0.2        153.62         19.97         4   27.9%
qoi:              0.0         0.0        482.51        392.37         4   28.7%
blend2d-qoi:      0.0         0.0        868.35        652.13         4   28.7%

# Grand total for qoi_benchmark_suite/images/icon_512
            decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:           1.4        10.1        181.90         26.01        51    5.0%
blend2d-png:      0.4         2.8        693.39         93.27        49    4.9%
stbi:             0.9         7.2        296.64         36.39        69    6.8%
qoi:              0.4         0.7        666.37        398.50        85    8.4%
blend2d-qoi:      0.2         0.3       1331.57        937.25        85    8.4%

# Grand total for qoi_benchmark_suite/images/photo_kodak
            decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:           4.1        72.9         97.08          5.40       648   56.3%
blend2d-png:      1.7        10.5        233.00         37.27       697   60.6%
stbi:             4.5        34.4         86.66         11.43       873   75.8%
qoi:              1.9         2.2        206.47        179.57       671   58.3%
blend2d-qoi:      1.0         1.4        381.22        282.69       671   58.3%

# Grand total for qoi_benchmark_suite/images/textures_photo
            decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:          10.2       165.1        102.41          6.35      1836   59.8%
blend2d-png:      5.3        25.0        197.11         41.94      2168   70.6%
stbi:            12.5        82.6         84.21         12.69      2334   76.0%
qoi:              4.7         5.1        220.91        207.54      1981   64.5%
blend2d-qoi:      2.5         3.5        414.78        301.93      1981   64.5%

# Grand total for qoi_benchmark_suite/images/screenshot_web
            decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:          29.9       326.6        271.32         24.88      2402    7.6%
blend2d-png:     10.6        88.7        767.99         91.65      2619    8.3%
stbi:            23.7       267.2        343.15         30.41      3076    9.7%
qoi:             12.8        21.8        632.90        373.04      2649    8.3%
blend2d-qoi:      6.9         9.8       1169.91        828.74      2649    8.3%

Further Ideas

I think honestly that DEFLATE is not worth optimizing further, because it simply wasn't designed for SIMD. The decoding is a completely scalar process: after bunch of bits get decoded, you can decode more. And here comes the latency - table lookups are scalar and depend on each other, thus it's virtually impossible to be clever about it and implement something significantly better than what's available today, without bringing completely new ideas to the table.

One idea I had was to use AVX-512 7-bit table lookup instruction to decode the bit-stream in parallel (speculatively). Basically if we only use 8 bits from the decode table, we can decode these 8 bits and 63 consecutive combinations that follow (each combination one bit shifted compared to the initial lookup) with a pair of VPERMI2B/VPERMT2B instructions (it would require an X86_64 CPU with AVX512_VBMI extension, so bye bye Intel!). This could possibly scale up to more bits (9 would be achievable) with the cost of using more lookups - thankfully these lookups could be done in parallel and just combined later to select the matching ones. However, I didn't proceed with this idea as I'm not sure I would want to burn my time that is very scarce now. However, it could definitely replace memory load with permutations, which are done completely within a ZMM register, so theoretically the latency could be improved a lot, provided that the decoding won't exceed the limitation of this lookup (8 or 9 bits) often.

Conclusion

Blend2D's PNG decoder is probably one of the fastest on the market based on the benchmarks shown in this post, and completely free and open-source! Since both zune-image and a image-rs are roughly comparable to wuffs (at least according to this post) it's also much faster than decoders written in other programming languages, but of course an independent verification would be great.

Memory safety doesn't play any role in terms of performance in this case, because we are talking about a completely scalar algorithm, so lesser latency translates into more throughput. The heart of every optimized DEFLATE decoder is always a "fast" loop, which ensures ALL memory accesses are safe within that loop body, which again translates into a lesser latency, thus higher throughput.

I would love to see my ideas further refined and perfected. I would be very interested in a working AVX512_VBMI implementation as mentioned in Further Ideas section.