This page discusses the most important work that makes Blend2D innovative. Many engines that provide 2D rendering capabilities are actually very similar in design unlike Blend2D, which was designed from scratch for delivering maximum performance. Dynamic pipeline construction is probably the most important feature that makes Blend2D unique, however, there is much more innovations in the library that is worth discussion.
The following topics are discussed:
This page is a work-in-progress and will be completed in the future.
Blend2D uses JIT compiler to generate 2D pipelines at runtime. The pipeline generator called PipeGen uses AsmJit library that provides everything needed to write JIT compilers. Dynamic pipeline construction allows Blend2D to combine multiple rendering stages of 2D rendering pipeline into a single stage called simply pipeline in Blend2D terminology. Such design eliminates the need to use buffers for exchanging data between multiple rendering stages as illustrated below.
PipeGen is modular by design and uses parts to compose a pipeline. Each part is responsible for something else and can reference other parts that it calls during code generation. Parts can be split into the following categories:
FillPart- describes where to fill (rectangle or cells from rasterizer)
FetchPart- describes the fill style (solid color, gradient, or texture)
CompOpPart- describes the composition operator used to combine source pixels with backdrop (destination) pixels
Each pipeline can be uniquely identified by a pipeline signature, which is a 32-bit number that describes what the pipeline does (which parts must be used to build it) and some other properties like pixel formats. Pipeline signature is then used to manage already compiled pipelines so the rendering context can find them instead of compiling new one each time a pipeline is needed. Each pipeline is compiled only once on demand and then cached.
A lot of research happened during the development of PipeGen with a goal to minimize the number of pipelines that would have to be generated by the runtime for the most common operations. Merging all operations into a single pipeline means that every possible combination requires its own pipeline. The problematic parts are compositing operators and source styles as these have a lot of additional combinations like pixel formats and extend modes. The following optimizations were implemented to minimize the theoretical maximum number of pipelines:
FillRuleparametrization. Only a single value is needed to describe whether the pipeline should fill by NonZero or EvenOdd fill rule. This is possible, because the analytic rasterizer doesn't need to know anything about fill rule in advance. The approach is actually very simple and is vectorized. It's common that pipelines calculate 4 alpha masks at once.
ExtendModeparametrization. Gradients and textures use extend modes to describe how source pixels are fetched. PipeGen is able to compile a single pipeline that can be parameterized to either pad, repeat, or reflect the source outside of natural bounds. As a result of this parametrization Blend2D allows to specify extend modes of textures independently of X and Y.
PipeGen takes advantage of SIMD instructions to perform gradient and texture steps during pixel fetching.
Blend2D implements an analytic rasterizer written from scratch that has the same output quality as rasterizers used by FreeType and AntiGrain. The approach of calculating alpha coverages is similar to FreeType and AntiGrain with some additional tweaks and optimizations, however, Blend2D uses a completely different approach to manage data that the rasterizer outputs and works with.
The primary goal of writing new rasterizer was to avoid using sparse data structures used to manage cells that are altered during rasterization and then used by the pipeline to calculate final alpha masks. It was observed that sparse cell-buffer has only benefits when rendering something very simple. When rendering complicated art it becomes a liability, because the rasterizer must first find the cell it wants to alter, and if such sparse cell-buffer is implemented as a linked list (FreeType, Qt) then each new cell-span makes such search more time consuming as more list items have to be iterated over. There are possible solutions to this problem like binary trees, however, that makes the cell management even more complicated and this idea was not explored, because the goal was to simplify it instead.
Blend2D Performance page can actually be used as a proof that sparse cell-buffer is not a good idea for rendering complex vector art. All FillPoly and StrokePoly tests show that Qt scales badly in these benchmarks and the primary cause is the linked-list traversal. The problem is that Qt uses FreeType rasterizer developed for rasterizing glyphs stored in font files that are usually not very complex and such tests stress the rasterizer beyond its limits.
The answer to the question "how to avoid sparse cell-buffer?" was always "use dense cell-buffer". In fact this was never the question, the question always was how to manage dense cell-buffer efficiently. The initial problem with dense cell-buffer was allocation of cells for the whole rendering area. If such area is
4096x2160 then the space requirement for dense cell-buffer would be
4096*2160*sizeof(uint32_t), which is equal to
33.75MB. This is unacceptable, because such data would be needed even if we are going to render only a single pixel. In addition, there is another problem here - the pipeline whould have to iterate over the whole data to actually composite such pixel. This approach would only be beneficial if the rendering operation is very small (like 8x8 pixels) and that would still not guarantee that it would be faster than iterating over sparse cells.
To solve this problem Blend2D introduces a concept of banding. Banding means that instead of allocating dense cell-buffer for the whole destination area only a single band is allocated. The height of such band must be power of 2 for indexing, but the final value can be adjusted based on its width so the whole area never gets larger than what can fit into a processor cache or some fraction of it. With banding the initial memory requirements for 4K resolution drop to 128kB, 256kB, and 512kB for bands having height of 8, 16, and 32 pixels, respectively. Such memory requirements are acceptable and should be considered as a worst case scenario as not all renderings happen on 4K framebuffer.
More ideas were initially considered including tiling, however, banding seemed like a much better idea and most importantly banding makes data for each scanline continuous, which makes pipelines simpler.
Dense cell-buffer alone is not enough to improve the performance of the rasterizer. The problem is that without knowing where the altered cells are the pipeline would have to check all cells, which would become very slow with increasing size of the destination. To solve this problem a shadow bit-buffer is used, in which each bit represents N cells in the cell-buffer. If the bit is set it means that such cells were altered by the rasterizer and must be processed by the pipeline. If the bit is zero it means that all cells the bit refers to are also zero.
It was measured that the ideal number of cells represented by a single bit in shadow bit-buffer is 4 or 8 depending on complexity of the composition operation and source style. At the moment Blend2D uses 4 cells per one bit, which means that a single 64-bit machine word can be used to mark 256 cells in cell-buffer. Since each modern CPU provides instructions for counting trailing bits in a machine word the pipeline can use such instruction to get exactly where the non-zero cells are. Empty (zero) bits describe empty cells, which are skipped, and filled bits describe dirty cells, which are processed.
Cell-buffer and Bit-buffer must be initially empty otherwise the rasterizer would have to clear it before each rasterization. Blend2D provides a specialized allocator (
BLZeroAllocator) that is used to allocate zeroed memory (like
calloc() does), but the memory released to it must also be zeroed. This allows the allocator to allocate such memory quickly, but requires the users of such allocator to clear the memory before releasing. This is still not an improvement as it doesn't really matter if the memory is zeroed at the beginning or at the end of rasterization.
What if the memory is zeroed by the pipeline itself? It processes the buffers, it knows exactly where the data is, so it can also clear it fast after it processes it. This is exactly what pipelines do, they scan bit-buffer to get where altered cells are, then zero these bits and cells after using them. The operation has almost no overhead as the CPU has to read such memory anyway, so one additional write to it is perfectly fine as the memory is still in cache.