Introduction to Multithreaded Rendering

Blend2D's rendering context can utilize multiple threads that render to the same target image. This section elaborates on various aspects of this feature and includes useful tips to avoid problems that naturally arise with asynchronous rendering, i.e. the rendering context doesn't change the content of the target image immediately. Instead, it serializes the work to be done into batches that are then executed by worker threads when ready.

Since most of the software-based 2D rendering engines are synchronous some users may expect that calling a function like fillRect() would immediately change relevant pixels of the target image and then return. This is true for many 2D libraries like AGG, Cairo, Qt, and Skia when a software accelerated rendering is used. We can describe such rendering mode as synchronous as it blocks until each operation is completed. However, this assumption is not true with Blend2D's multi-threaded rendering context implementation. Such implementations will be referred to as asynchronous, because in that case the rendering context only serializes the requested operations to be completed later in batches.

Important

The multi-threaded rendering engine implementation is still experimental. We believe that its performance can be improved in the future, especially its synchronization and resource sharing parts. We definitely need more feedback from Blend2D users to improve the implementation even further.

Initialization & Finalization

By default a single-threaded (and thus synchronous rendering context) is created unless a filled BLContextCreateInfo is passed to the rendering context's constructor or begin() member function. A threadCount member specifies how many threads to use. At the moment there are no defaults or heuristics in the library so users must carefully use a value that makes sense for their workloads. In general 2 to 4 threads should be suitable for most workloads including real-time rendering to 4K frame-buffers. If the framebuffer is relatively small (e.g. 200x200 pixels) it's recommended to either use single-threaded rendering by default or to at least benchmark whether it has any benefits before switching to multi-threaded rendering.

The initialization may look like this:

#include <blend2d.h>

int main(int argc, char* argv[]) {
  BLImage img(500, 500, BL_FORMAT_PRGB32);

  // The best is to use {} initializer so the struct content is zeroed.
  // Blend2D guarantees that any future additions will work well with
  // zeros by default.
  BLContextCreateInfo createInfo {};

  // Configure the number of threads to use.
  createInfo.threadCount = 2;

  // Create the rendering context.
  //
  // Either use constructor like `BLContext ctx(img, creteInfo)` or
  // `BLContext::begin()`, which returns an error code that can be
  // inspected. This operation shouldn't fail by default unless there
  // is not enough memory to allocate the context and some helper data
  // structures.
  //
  // In addition, if the internal thread-pool is exhausted the rendering
  // context would still be asynchronous, but would only use the user
  // thread. This behavior can be overridden by changing initialization
  // flags in 'createInfo'.
  BLContext ctx;
  BLResult result = ctx.begin(img, createInfo);

  // Verify the rendering context was created.
  if (result != BL_SUCCESS)
    return 1;

  // ... render something ...

  // Use either ctx.flush() or ctx.end() to synchronize, will be described
  // later.
  ctx.end();

  // Now you can use the content of `img`.
  img.writeToFile("bl-multithreaded-example-init.bmp");

  return 0;
}

It can be seen from the previous example that Blend2D provides the same API for both synchronous and asynchronous rendering, only the initialization phase is different. Additionally, one important detail is the significance of the end() function. For the synchronous case it's a relatively cheap call that only releases some resources and detaches the rendering context from the image. However, in the asynchronous case, end() actually synchronizes first. The rendering will start at that point and the execution will block until it's done. Only after it's done the resources can be released and the function can return.

Synchronization

There are two functions that can be used to require synchronization of the rendering context, which means flushing the render queue and waiting for completion:

  • flush(BL_CONTEXT_FLUSH_SYNC) - flushes all pending operations and waits for their completion
  • end() - this call does the same as flush(BL_CONTEXT_FLUSH_SYNC) in addition to detaching the rendering context from the target image after there are no more pending operations

Both flush() and end() calls are blocking, which means that they would only return after all threads finish their work. The importance of flush can be illustrated on the next example:

#include <blend2d.h>

int main(int argc, char* argv[]) {
  BLContextCreateInfo createInfo {};
  createInfo.threadCount = 2;

  BLImage img(500, 500, BL_FORMAT_PRGB32);
  BLContext ctx(img, createInfo);

  ctx.fillAll(BLRgba32(0xFF000000));
  ctx.fillRoundRect(BLRoundRect(25, 25, 450, 450, 50), BLRgba32(0xFFFFFFFF));

  // This won't work! The content of the pixels is undefined at the moment.
  img.writeToFile("bl-multithreaded-example-sync-wrong.bmp");

  // This actually flushes all render calls and synchronizes.
  ctx.flush(BL_CONTEXT_FLUSH_SYNC);

  // Now the image contains everything we have asked for.
  img.writeToFile("bl-multithreaded-example-sync-right.bmp");

  // This would synchronize if there were any pending operations, but in
  // our case it would just detach the rendering context from the image
  // and return all threads to the thread pool (which is a cheap operation).
  ctx.end();

  return 0;
}

Lifetime of Passed Objects

All objects that are passed to the rendering context (such as gradients, patterns, images, and paths) must live through asynchronous processing. To simplify this Blend2D automatically manages the lifetime of such objects by using reference counting. When an object is passed to the rendering context, and the context decides to keep that object for later use, it would temporarily increase the reference count of such object. In other words, it creates a weak copy of it so that it can be used later by a worker thread later. Blend2D uses atomic reference counting, which enables the sharing of objects between multiple threads.

There is one important detail regarding reference counting in Blend2D, called copy-on-write. When two BLPath instances share the same data and one instance is being changed, the data has to be copied first so that the other instance's data doesn't change as well. It is called copy-on-write, because the copy will only be made if such object is changed.

Let's demonstrate how it works:

#include <blend2d.h>

int main(int argc, char* argv[]) {
  BLContextCreateInfo createInfo {};
  createInfo.threadCount = 2;

  BLImage img(500, 500, BL_FORMAT_PRGB32);
  BLContext ctx(img, createInfo);

  ctx.fillAll(BLRgba32(0xFF000000));

  {
    // Let's construct a path - now it will have a reference count 1.
    BLPath path;
    path.addRoundRect(BLRoundRect(25, 25, 450, 450, 25));

    // The reference count of path may change depending on the rendering type
    // and path complexity:
    //
    //   - Synchronous rendering - path is consumed directly so there is no
    //     need to keep it. The rendering context uses the path to build edges
    //     and then such edges will be used by the rasterizer.
    //
    //   - Asynchronous rendering - depending on path complexity the rendering
    //     context may construct edges for the rasterizer directly or it may
    //     decide to keep the path and construct edges later. If it decides to
    //     keep the path then its reference count will be increased by 1.
    ctx.fillPath(path, BLRgba32(0xFFFFFFFF));

    // Now the path destructor will be called. If the reference count is greater
    // than one then the path data will outlive this scope and the rendering context
    // will destroy it after it's been processed.
  }

  // Even if the path data was still alive at this point, it won't outlive neither
  // flush() nor end() calls.
  ctx.end();

  return 0;
}

The same scenario applies to images. E.g. the in asynchronous case it's not possible to blit an image, and then immediately change its content and then blit it again. To be more precise it woudle be possible, but when changing the image a deep copy of the image will be made before attempting to change it.

#include <blend2d.h>

int main(int argc, char* argv[]) {
  BLContextCreateInfo createInfo {};
  createInfo.threadCount = 2;

  BLImage img(500, 500, BL_FORMAT_PRGB32);
  BLContext ctx(img, createInfo);

  BLImage sprite;
  sprite.readFromFile("some_sprite.png");

  ctx.fillAll(BLRgba32(0xFF000000));

  // If the rendering context is asynchronous the sprite's reference count will
  // be increased by one by the rendering context as it won't blit it now.
  ctx.blitImage(BLPointI(0, 0), sprite);

  // Now the sprite's reference count may be greater than zero, which means that
  // a copy will have to be made if it has to be changed. For example attaching
  // a different rendering context to that image would make a copy of it.
  {
    BLContext ctx2(sprite);
    ctx2.fillRect(BLRectI(10, 10, 100, 100), BLRgba32(0xFFFFFFFF));
  }

  // If the rendering context is asynchronous the sprite's reference count will
  // be increased by one again, but in this case the sprite's internal data will
  // already be different compared to the first blitted sprite. This means that
  // after this call the rendering context will have weak copies of two independent
  // images.
  ctx.blitImage(BLPointI(200, 200), sprite);

  // As usual - end() will drop all references to images passed to blitImage().
  ctx.end();

  return 0;
}

In the example above the user would actually get the same result in both the synchronous and asynchronous case. However, in the asynchronous case the image data will be copied by the secondary rendering context as it cannot start rendering into a shared image that is still in use by the primary rendering context.

Geometries Are Lightweight

The rendering context can render paths and simpler geometries that are passed as C/C++ structs. Since the multi-threaded rendering context has to retain all paths passed to it, it's sometimes cheaper to use geometries instead. For example to fill a rounded rectangle, use fillRound() instead of constructing a path and then filling it via fillPath(). To retain a geometry the rendering context just copies its data to a buffer, which would be then used by worker threads. Making a copy of a small C/C++ structure is much cheaper than managing a shared resource.

For example, consider the following example:

#include <blend2d.h>

int main(int argc, char* argv[]) {
  BLContextCreateInfo createInfo {};
  createInfo.threadCount = 2;

  BLImage img(500, 500, BL_FORMAT_PRGB32);
  BLContext ctx(img, createInfo);

  ctx.fillAll(BLRgba32(0xFF000000));

  // This rounded rectangle will be serialized as is, it won't use BLPath.
  ctx.fillRound(BLRoundRect(100, 100, 400, 100, 50), BLRgba32(0xFFFF0000));

  // This rounded rectangle is in fact a BLPath, and the rendering context
  // will have to retain it.
  BLPath p;
  p.addRoundRect(BLRoundRect(100, 300, 400, 100, 50));
  ctx.fillPath(p, BLRgba32(0xFF0000FF));

  ctx.end();

  return 0;
}

Conclusion

This section has illustrated how the multi-threaded rendering context changes the lifetime of passed objects. It should have also been clarified that the user may expect those objects released by either calling flush() with particular flags or end(). Finally, it has described the functionality of copy-on-write which results in deep copy of objects that are mutated and still in use by the rendering context.

It's also worth noting that calling flush() with BL_CONTEXT_FLUSH_SYNC flag would start worker threads and wait for all of them to complete. These calls should be considered exceptional and not be overused. It isn't cheap to wake up all worker threads and wait for them to complete their work, which is the main reason that the rendering context uses batching. The result is a minimal synchronization and maximum throughput.