High-Performance QOI Codec

The Quite OK Image Format (QOI) (wiki) is an image format announced on 24 November 2021. We weren't really paying attention back then as it seemed too new and we didn't know whether the existing ecosystem is going to support it. However, since a lot of existing applications and libraries added support for QOI, we have decided to add a built-in QOI codec to Blend2D too!

Blend2D is all about optimizations so when we were implementing QOI support we have explored how to optimize the codec to achieve better performance than QOI reference implementation and some other variants such as magicqoi, which claim performance improvements.

QOI Format Simplicity

QOI image format's core goal is to be simple to understand and implement. The format itself is quite limited and focuses only on 8-bit RGB and RGBA image data. This means that it doesn't use any elaborate data structures, huffman coding, etc... so, the ability to compete with PNG's in terms of image size is pretty remarkable. QOI itself uses 6 operations to encode a pixel or a run of pixels (a chunk encoding):

QOI_OP_RGB - Encodes RGB components of an RGBA pixel - the alpha value is taken from the previous pixel (4 bytes of data)
QOI_OP_RGBA - Encodes RGBA components of an RGBA pixel (5 bytes of data)
QOI_OP_INDEX - Encodes an RGBA pixel by using the previously stored pixels in a 64 entry hash table (1 byte of data)
QOI_OP_DIFF & QOI_OP_LUMA - Encodes a difference to the previous pixel (either 1 or 2 bytes of data)
QOI_OP_RUN - Encodes a run of pixels (the previous pixel is used as a pixel value) (1 byte of data)

Each pixel the encoder or decoder sees is added to a 64 entry hash table. It must be there so QOI_OP_INDEX could find it in case it's back referenced. That's it! This is a simplified overview of a QOI image format.

The Good

The chunk encoding can encode pixels that have only small changes (compared to a previous pixel) in a single byte or 2 bytes. Potentially saving up to 3 bytes per pixel if the difference is not big. This competes pretty much with PNGs filtering, which also pre-filters image data so it's easier to compress with DEFLATE algorithm
Additionally, the hashing is an interesting concept as it can be used to reference pixels encountered in the past, which can be possibly too far to employ delta encoding. The hashing function is problematic though, which we are going to talk about later
The decoding & encoding is really fast and its complexity is O(N)
The encoder must add 8-byte padding at the end of the image data, which simplifies out of bounds checking when decoding chunks (there is always enough data to decode any chunk)

The Ugly

QOI focused from the beginning on 8-bit RGBA images, so it's a bad idea to use it for grayscale or alpha-only images
The hash table makes QOI decoding to be always fully sequential, so it will never be a good format for utilizing multi-threaded decoding
The hash function (R * 3 + G * 5 + B * 7 + A * 11) % 64 is not particularly good as it ignores 2 MSB bits of each component, but this is something that the community around QOI has already discussed
We have found a redundant chunk encoding: QOI_OP_DIFF specifying a difference of [0, 0, 0] is the same as QOI_OP_RUN specifying a repeat sequence of 1. This basically wastes a chunk header byte that could have been either repurposed for something else or QOI_OP_RUN should have been biased to describe at least 2 pixels run. In addition, referencing a previous pixel can be done by using QOI_OP_INDEX as well, so there are 3 chunk encodings in total that would encode a single repeat of the previous pixel

Optimizing the Hash Function

The chunk encoding makes it almost impossible to use SIMD with QOI, which means that the optimizations should focus on improving the math used by each chunk first, and then employing statistics could be used to implement a SIMD decoder. We have started with optimizing the hash table:

static constexpr uint32_t kQoiHashR = 3u;
static constexpr uint32_t kQoiHashG = 5u;
static constexpr uint32_t kQoiHashB = 7u;
static constexpr uint32_t kQoiHashA = 11u;
static constexpr uint32_t kQoiHashMask = 0x3Fu;

// Reference implementation of a QOI hash function.
//
// NOTE: This one is already a little optimized as it doesn't use AND operation
// to unmask each pixel. It's not necessary since only 6 LSB bits of each pixel
// are actually used by the hash function.
static uint32_t qoiHash(uint32_t pixel) noexcept {
  return ((pixel >> 24) * kQoiHashA +
          (pixel >> 16) * kQoiHashR +
          (pixel >>  8) * kQoiHashG +
          (pixel >>  0) * kQoiHashB) & kQoiHashMask;

}

C++ compilers are generally able to get rid of the multiplication and to rely on bit-shifts, which is great, but it's still a lot of instructions to execute. We have developed an optimized version of the hash function that takes advantage of using 64-bit arithmetic (on 64-bit target) and that take advantage of unpacking a single 32-bit ARGB pixel into 0x00AA00GG00RR00BB form, which is an old idea used in non-SIMD alpha blending. This form can be used also for applying pixel deltas, which we are going to explain next.

The optimized hash function thus becomes:

// QOI hash function optimized for 64-bit targets.
static uint32_t hashPixelAGxRBx64(uint64_t ag_rb) noexcept {
  ag_rb *= (uint64_t(kQoiHashA) << ( 8 + 2)) + (uint64_t(kQoiHashG) << (24 + 2)) +
           (uint64_t(kQoiHashR) << (40 + 2)) + (uint64_t(kQoiHashB) << (56 + 2)) ;
  return uint32_t(ag_rb >> 58);
}

// QOI hash function optimized for 32-bit targets (avoids 64-bit multiply).
static uint32_t hashPixelAGxRBx32(uint32_t ag, uint32_t rb) noexcept {
  ag *= ((kQoiHashA << (0 + 2)) + (kQoiHashG << (16 + 2)));
  rb *= ((kQoiHashR << (8 + 2)) + (kQoiHashB << (24 + 2)));

  return (ag + rb) >> 26;
}

static uint32_t hashPixelRGBA32(uint32_t pixel) noexcept {
#if BL_TARGET_ARCH_BITS >= 64
  return hashPixelAGxRBx64(((uint64_t(pixel) << 24) | pixel) & 0x00FF00FF00FF00FFu);
#else
  return hashPixelAGxRBx32(pixel & 0xFF00FF00u, pixel & 0x00FF00FFu);
#endif
}

The hash function without unpacking translates to the following instructions (x86_64):

hashPixelAGxRBx64(uint64_t@rdi):
    movabs  rax, 13194139544578
    imul    rax, rdi
    shr     rax, 58
    ret

Optimizing Decoding of QOI_OP_RGB and QOI_OP_RGBA chunks

Blend2D decodes both QOI_OP_RGB and QOI_OP_RGBA chunks branchlessly. The idea is to blend the decoded pixel based on a mask calculated from the chunk type:

static uint32_t decodeRGBX(uint32_t prev, uint32_t header, const uint8_t* rgbaData) noexcept {
  // Handle both QOI_OP_RGB and QOI_OP_RGBA at the same time.
  uint32_t pixel = (uint32_t(rgbaData[0]) << 16) |
                   (uint32_t(rgbaData[1]) <<  8) |
                   (uint32_t(rgbaData[2]) <<  0) |
                   (uint32_t(rgbaData[3]) << 24) ;

  // QOI_OP_RGB  == 0xFE -> msk == 0xFF000000
  // QOI_OP_RGBA == 0xFF -> msk == 0x00000000
  uint32_t msk = (header + 1) << 24;

  return (prev & msk) | (pixel & ~msk);
}

Optimizing Decoding of QOI_OP_DIFF and QOI_OP_LUMA chunks

Both QOI_OP_DIFF and QOI_OP_LUMA are used to store a new pixel value based on the previous pixel and a small delta of RGB components. We can decode any of these without branching, again based on masking and on selecting the right delta based on the header. But that's not all, we can use the same 64-bit unpacked pixel as used by hashing to apply the deltas of all 3 components in a single arithmetic operation.

// Compute both DIFF and LUMA deltas and then blend between these two without branching.
static uint64_t calculateDiffLumaDelta64(uint32_t hbyte0, uint32_t hbyte1) noexcept {
  // QOI_OP_DIFF chunk (0b01xxxxxx) - the bias is the same for RGB channels (-2).
  constexpr uint64_t kDiffBiasRGB = 0x100u - 2u;
  constexpr uint64_t kDiffBiasAll = kDiffBiasRGB * 0x000100010001u;

  // QOI_OP_LUMA chunk (0b10xxxxxx) - the bias is -32 for G channel and -40 for R and B channels.
  constexpr uint64_t kLumaBiasG = (0x100u ^ 0x80u) - 32;
  constexpr uint64_t kLumaBiasRB = (0x100u ^ 0x80u) - 40;
  constexpr uint64_t kLumaBiasAll = kLumaBiasRB * 0x00010001u + (kLumaBiasG << 32);

  // Compute both DIFF and LUMA deltas and then blend between these two without branching.
  uint64_t lumaDelta = (uint32_t(hbyte1) * (1u | (1u << 12))             ) & 0x00000000000F000Fu;
  uint64_t diffDelta = (uint64_t(hbyte0) * (1u | (1u << 12) | (1u << 30))) & 0x0000000300030003u;

  lumaDelta += uint64_t(hbyte0) * 0x0000000100010001u;
  diffDelta += kDiffBiasAll;
  lumaDelta += kLumaBiasAll;

  return hbyte0 < QOI_OP_LUMA ? diffDelta : lumaDelta;
}

The function above takes two bytes hbyte0 and hbyte1 that specify either QOI_OP_DIFF or QOI_OP_LUMA and calculates both deltas, which are then selected at the end based on the first header byte. The function returns a 64-bit value (unpacked pixel) which is then added to the previous unpacked pixel and packed back to a 32-bit pixel value. The hash value of the new pixel is calculated from the intermediate 64-bit unpacked representation.

Optimizing QOI_OP_DIFF and QOI_OP_LUMA chunks with LUT

The optimized approach above is great, but we can do much better if we take the first byte and use it in a lookup table to get base deltas for both QOI_OP_DIFF and QOI_OP_LUMA chunks. Since the LUT provides all precalculated deltas of the first byte we would only need to handle the second byte, which only encodes deltas of R and B components.

// Lookup table generator that generates delta values for QOI_OP_DIFF and the first byte of QOI_OP_LUMA.
// Additionally, it provides values for a RLE run of a single pixel for a possible experimentation.
struct IndexDiffLumaTableGen {
  static constexpr uint32_t rgb(uint32_t r, uint32_t g, uint32_t b, uint32_t lumaMask) noexcept {
    return ((r & 0xFFu) << 24) |
           ((g & 0xFFu) << 16) |
           ((b & 0xFFu) <<  8) |
           ((lumaMask ) <<  0) ;
  }

  static constexpr uint32_t diff(uint32_t b0) noexcept {
    return rgb(((b0 >> 4) & 0x3u) - 2u, ((b0 >> 2) & 0x3u) - 2u, ((b0 >> 0) & 0x3u) - 2u, 0x00u);
  }

  static constexpr uint32_t luma(uint32_t b0) noexcept {
    return rgb(b0 - 40u, b0 - 32u, b0 - 40u, 0xFF);
  }

  static constexpr uint32_t value(size_t idx) noexcept {
    return idx < 64u  ? diff(uint32_t(idx      )) :
           idx < 128u ? luma(uint32_t(idx - 64u)) : 0u;
  }
};

static constexpr LookupTable qoiIndexDiffLumaLUT = makeLookupTable<uint32_t, 129, IndexDiffLumaTableGen>();

With the table above it's possible to take the first byte of either QOI_OP_DIFF and QOI_OP_LUMA chunk and get base deltas by using it as hbyte0 - 64 index to the LUT. If the chunk is QOI_OP_LUMA it's possible to branchlessly extract the data from a second byte and accumulate it to the base delta from the lookup table. This approach is much shorter than the previous optimization and doesn't need multiplications at all, thus only instructions with low latency are required. The lumaMask byte from the table is used to either mask or unmask the second header byte that is only used when the chunk is QOI_OP_LUMA.

Performance

Of course we have measured the performance during the development and all of the optimizations mentioned in this post helped Blend2D's QOI codec to achieve the best results when compared with other implementations. The following results were measured on a development machine with AMD Ryzen 7590X CPU, where we have measured both decoding and encoding time of the following QOI images in milliseconds:

$ ./bl-bench-qoi
[qoi_ref ] qoi/dice.qoi            : Decode  1.000 [ms] | Encode  1.638 [ms]
[magicqoi] qoi/dice.qoi            : Decode  0.785 [ms] | Encode  0.955 [ms]
[blend2d ] qoi/dice.qoi            : Decode  0.716 [ms] | Encode  0.800 [ms]

[qoi_ref ] qoi/edgecase.qoi        : Decode  0.022 [ms] | Encode  0.040 [ms]
[magicqoi] qoi/edgecase.qoi        : Decode  0.008 [ms] | Encode  0.011 [ms]
[blend2d ] qoi/edgecase.qoi        : Decode  0.003 [ms] | Encode  0.008 [ms]

[qoi_ref ] qoi/kodim10.qoi         : Decode  2.562 [ms] | Encode  3.439 [ms]
[magicqoi] qoi/kodim10.qoi         : Decode  2.437 [ms] | Encode  3.452 [ms]
[blend2d ] qoi/kodim10.qoi         : Decode  1.380 [ms] | Encode  2.037 [ms]

[qoi_ref ] qoi/kodim23.qoi         : Decode  2.521 [ms] | Encode  3.599 [ms]
[magicqoi] qoi/kodim23.qoi         : Decode  2.451 [ms] | Encode  3.561 [ms]
[blend2d ] qoi/kodim23.qoi         : Decode  1.357 [ms] | Encode  2.006 [ms]

[qoi_ref ] qoi/qoi_logo.qoi        : Decode  0.119 [ms] | Encode  0.251 [ms]
[magicqoi] qoi/qoi_logo.qoi        : Decode  0.029 [ms] | Encode  0.054 [ms]
[blend2d ] qoi/qoi_logo.qoi        : Decode  0.019 [ms] | Encode  0.048 [ms]

[qoi_ref ] qoi/testcard.qoi        : Decode  0.121 [ms] | Encode  0.173 [ms]
[magicqoi] qoi/testcard.qoi        : Decode  0.063 [ms] | Encode  0.080 [ms]
[blend2d ] qoi/testcard.qoi        : Decode  0.035 [ms] | Encode  0.060 [ms]

[qoi_ref ] qoi/wikipedia_008.qoi   : Decode  6.794 [ms] | Encode  8.389 [ms]
[magicqoi] qoi/wikipedia_008.qoi   : Decode  6.639 [ms] | Encode  8.593 [ms]
[blend2d ] qoi/wikipedia_008.qoi   : Decode  3.469 [ms] | Encode  5.921 [ms]

As can be seen Blend2D beats all the existing implementations. We have also measured the lowest branch misprediction rate at least according to Linux perf tool during our testing.

The images used for benchmarking were downloaded from QOI homepage (download link). For comparison, here are results of a PNG benchmark that uses the same set of images, but encoded as PNGs:

$ ./bl-bench-png
[libpng  ] qoi/dice.png            : Decode  5.022 [ms]
[wuffs   ] qoi/dice.png            : Decode  3.526 [ms]
[spng    ] qoi/dice.png            : Decode  3.292 [ms]
[stbimage] qoi/dice.png            : Decode  4.170 [ms]
[blend2d ] qoi/dice.png            : Decode  3.820 [ms]

[libpng  ] qoi/edgecase.png        : Decode  0.109 [ms]
[wuffs   ] qoi/edgecase.png        : Decode  0.079 [ms]
[spng    ] qoi/edgecase.png        : Decode  0.088 [ms]
[stbimage] qoi/edgecase.png        : Decode  0.037 [ms]
[blend2d ] qoi/edgecase.png        : Decode  0.040 [ms]

[libpng  ] qoi/kodim10.png         : Decode  6.837 [ms]
[wuffs   ] qoi/kodim10.png         : Decode  4.487 [ms]
[spng    ] qoi/kodim10.png         : Decode  5.421 [ms]
[stbimage] qoi/kodim10.png         : Decode  5.330 [ms]
[blend2d ] qoi/kodim10.png         : Decode  4.855 [ms]

[libpng  ] qoi/kodim23.png         : Decode  5.738 [ms]
[wuffs   ] qoi/kodim23.png         : Decode  4.156 [ms]
[spng    ] qoi/kodim23.png         : Decode  4.977 [ms]
[stbimage] qoi/kodim23.png         : Decode  4.939 [ms]
[blend2d ] qoi/kodim23.png         : Decode  4.396 [ms]

[libpng  ] qoi/qoi_logo.png        : Decode  0.498 [ms]
[wuffs   ] qoi/qoi_logo.png        : Decode  0.493 [ms]
[spng    ] qoi/qoi_logo.png        : Decode  0.254 [ms]
[stbimage] qoi/qoi_logo.png        : Decode  0.198 [ms]
[blend2d ] qoi/qoi_logo.png        : Decode  0.294 [ms]

[libpng  ] qoi/testcard.png        : Decode  0.291 [ms]
[wuffs   ] qoi/testcard.png        : Decode  0.269 [ms]
[spng    ] qoi/testcard.png        : Decode  0.217 [ms]
[stbimage] qoi/testcard.png        : Decode  0.168 [ms]
[blend2d ] qoi/testcard.png        : Decode  0.230 [ms]

[libpng  ] qoi/wikipedia_008.png   : Decode 13.798 [ms]
[wuffs   ] qoi/wikipedia_008.png   : Decode 13.050 [ms]
[spng    ] qoi/wikipedia_008.png   : Decode 11.463 [ms]
[stbimage] qoi/wikipedia_008.png   : Decode 13.087 [ms]
[blend2d ] qoi/wikipedia_008.png   : Decode 12.330 [ms]

Practical Use

We have successfully used QOI as an exchange format in Blend2D Fiddle. QOI saved the execution time of most requests by 15-20% compared to the PNG format we used previously. Compilation time still dominates though.

Conclusion

QOI is indeed simple and very fast to decode & encode. Blend2D's implementation also seems the quickest considering it improves in several key areas compared to others. We don't look for further optimizations as we consider the current performance satisfactory. We would be happy if other implementations picked ideas from Blend2D's decoder to make QOI a faster format in general.

QOI is not perfect and it could have been a better format. The redundancy in QOI_OP_DIFF and QOI_OP_RUN chunk encoding hurts, but that's less of a problem than having no proper support for grayscale and alpha-only formats (but in that case QOI would have to provide a different chunk encoding that would be specialized for single byte-per-pixel formats). Variations of QOI exist to address various QOI limitations, but we don't plan supporting any of them at the moment.