BLEND 2D

next generation 2d vector graphics engine

WHY NEXT GENERATION?

Dynamic pipelines! Blend2D generates dynamic pipelines just-in-time and this section elaborates why dynamic pipeline generation rocks and always surpasses static pipelines. In addition, we try to explain why dynamic pipelines can change the way 2D graphics is generated. The revolution that happened in 3D segment by moving into programmatic pipelines and shaders can happen also in 2D, and Blend2D is the first engine that implements these new and challenging ideas.

Blend2D is not just about dynamic pipelines. The library has been engineered from the ground for high performance rendering and every module has been designed carefully to deliver maximum performance, but still offer nice and intuitive C++ API. Memory management, deferred rendering, and multithreaded rendering were also challenges that deserve attention, although their importance is probably not as significant as dynamic pipeline construction.

RENDERING PIPELINE

Rendering pipeline in 2D graphics is a set of steps that are required to render the given command into the destination buffer. A software based 2D rendering engine typically contains a set of fixed pipelines that call each step sequentially after each other, the last step is usually a composition, which combines the result from the previous step with backdrop (the destination buffer).

Static vs. Dynamic

There are fundamentally two distinct types of a rendering pipeline:

  • Static pipeline - generated at compile time and never changed during the application's life time.
  • Dynamic pipeline - generated by application's runtime, can be destroyed if it's not needed anymore.

Static pipelines are usually used by software based rendering engines (2D or 3D), dynamic pipelines are usually used by rendering engines that use GPU - these engines are often designed for rendering 3D graphics, not 2D. The reason why software based rendering typically uses static pipelines is complexity - it's fairly trivial to implement a static pipeline, but it's very complex to implement a dynamic one. Dynamic pipelines make sense when using GPU, because the compilation is usually performed by graphics drivers and tools behind the scene, but such concept doesn't exist in software-based rendering (except using tools like OpenCL).

Figure 1: Overview of Static Rendering Pipeline (2D)

Most software-based 2D rendering engines use static pipelines that contain at least fill, fetch, and composition steps (Figure 1). It's probably the most portable solution that has been well tested by industry. Unfortunately, such pipeline design is not optimal in terms of CPU utilization as will be explained later.

Rendering Steps

Static pipeline often executes rendering steps sequentially. Here is the list of the most used steps in 2D rendering:

  • Fill & Clip - fill operation that defines the shape to be filled. The most common fill operations are filling special shapes (like axis-aligned rectangles), strokes of special shapes (like hairline stroking), and generic fills that consume data produced by a polygon rasterizer. Clip defines how the content is clipped, if enabled. The simplest clip operation is clipping to an axis-aligned rectangle, but several other clip operations exist and are implemented by 2D engines. Clipping is often performed together with Fill step for performance reasons.
  • Fetch - fetches source pixels at a given coordinates - solid color, gradient, or pattern.
  • Composite - performs a composition of source pixels and destination (backdrop) pixels defined by composition and blending operators.

The steps described above are not independent of each other. Each step requires data from the previous step(s), which is typically solved by using one or more temporary buffer as illustrated on Figure 1. It's also obvious that not all of these steps have to be executed for each rendering command. Avoiding temporary buffers and bypassing some steps are the most used optimizations that make 2D rendering faster.

Fast-Path Optimizations

There are several ways used by 2D engines to fix some of the issues mentioned. One of the most successful method is called specialization. It's basically about implementing the most common cases as fast-paths (i.e. a special case that is used instead of the generic one). The most known fast-paths are axis-aligned rectangle fills, solid color fills, and inlined source-over composition.

Fast-paths can increase performance dramatically, but they also have limitations - not everything can be fast-path. This may sound as a little issue if the most common cases were covered, but it's difficult, in general, to explain the concept of fast-paths to library consumers if they don't know its internals. There will be most probably different fast-paths implemented across various 2D engines, and what is fast-path in one doesn't necessarily be a fast-path in another. In other words, fast-paths are implementation dependent optimizations and library consumers shouldn't worry about them so much.

Fast-paths can also increase compiled size of the library and complicate render dispatching. The questions 'is this a fast path?' and 'which function should I call for this combination?' have to be answered and the answer requires runtime to iterate over all available fast paths or use some kind of lookup tables to get the answer. If implemented the wrong way the dispatching itself could waste a significant amount of CPU cycles.

SIMD Optimizations

Another way of improving performance of rendering is taking advantage of SIMD instruction sets, if available. The most common SIMD instruction sets used by 2D engines today are SSE2 up to SSE4.1 (X86/X64), and NEON (ARM). SIMD is a general concept that is used to optimize a single step in the pipeline. Typically SIMD optimizations operate on 2 or more pixels at a time (depends on SIMD register width), which executes much faster than a scalar pixel-based manipulation.

The most complicated CPU architecture for utilizing SIMD is probably X86. Every new CPU generation contains new instructions that may be useful in 2D rendering, especially instructions that came after SSE2 - SSE3, SSSE3, and SSE4.1. However, libraries must maintain backward compatibility and SIMD optimizations are complementary to the C/C++ implementation. This means that library authors usually decide, and very carefully, which SIMD instruction set(s) to use and maintain in terms of backward compatibility and theoretical performance gain. The most common instruction set to be considered on X86 is SSE2. Instruction sets like SSE3, SSSE3, and SSE4.1 don't offer as much performance gain as SSE2 (read C baseline vs. SSE2 compared to SSE2 vs. SSE4.1) and they are often not considered to make resulting binaries smaller (basically it's the same issues as discussed in fast-paths).

SIMD optimized code also requires data to be aligned to 8, 16 or 32 bytes. Small loops need to be executed before and after the main loop to process pixels that cannot be processed in the main loop. This leading and trailing code usually increases the library size dramatically, as it has to be present before & after each SIMD optimized loop. There are techniques to minimize this code, for example Blend2D merges leading and trailing loops into a single loop.

BLEND 2D PIPELINE

Blend2D itself doesn't have any pipeline built-in. It compiles them just-in-time (JIT) by using a code generator based on asmjit library. This approach has several advantages over a more convenient static pipelines:

  • Temporary buffers are completely eliminated as the generated pipeline has everything inlined. CPU registers are used to exchange payload between two or more inlined rendering steps.
  • Everything is a fast-path. If there is any specialization applicable it will apply for all pipelines. For example if a solid fill is a special case in Blens2D's pipeline generator it would be applied to all generated pipelines that perform solid fills.
  • The best possible SIMD optimizations are always used by default. If the host CPU supports SSE4.1 it will be used in all generated pipelines if it leads to any improvement. This means that the code generator can replace for example 2 or 3 SSE2 instructions by a single SSE4.1 instruction to make the resulting code run faster.
  • Code generator can reorder instructions to maximize the instruction throughput. This means that some steps of the pipeline can execute in parallel if there are minimum or no dependencies between them (mostly SIMD) and there is enough CPU registers.

The pipeline generator is called SWPipe and it's fairly simple. It's based on asmjit's Compiler infrastructure, which is used to emit machine instructions directly without using any kind of intermediate language. This is a disadvantage in terms of porting to other architectures (targeting only x86/x64 at the moment), however, using architecture dependent instructions directly allows to create a fully optimized building blocks; and also helps to construct pipelines faster as it doesn't need to translate any kind of abstract representation to machine code. The only abstraction that is used to make the code generation simpler is using register allocator provided by asmjit.

Pipeline Parts

The SWPipe engine contains building blocks that are called SWParts. Each part implements a basic concept, which can emit a block of code based on its type and configuration. A part can be connected with another part and can invoke its code generation more than once. The generated pipeline is basically a composition of multiple parts that is compatible with the requested pipeline signature.

The reasons to use parts were abstraction and reusability. For example at the moment Blend2D doesn't need special pipelines, but in the future it will be possible to construct specialized pipelines that can for example blend two images together and then blend the result with backdrop. Windowing systems can generate several pipelines for cross-fading multiple widgets (aka buffers) together, an so on... Use cases are basically only limited by the number of CPU registers and theoretical gains.

The following parts of the pipeline exist:

  • Fill - produces spans of fixed (cMask) or variant (vMask) alpha.
  • Fetch - implements pixel fetching; it's responsible for generating pixels that are consumed by the compositor. The Fetch part is always invoked by the Composition part.
  • Composition - implements pixel composition and blending; it's responsible for combining two pixels by taking into account the current alpha coverage (produced by the Fill part). The Composition part invokes the Fetch part and is always invoked by the Fill part.

All parts are constructed by the pipeline generator (SWPipeGen) first, based on the given signature, which is always unique. Afterwards, these parts are connected together and the code generator invokes the Fill part, which then calls all connected parts at least once. After all parts finish the code generator calls asmjit to finalize the function. Finalizing means to allocate variables into registers and to insert prolog and epilog (called PEI in LLVM terminology).

Design Gotchas

So what's the catch? We have the pipeline generator that can construct arbitrarily pipelines on-the-fly. However, there is an issue that is related to the removal of temporary buffers - there is only a limited number of registers that can be used to exchange the payload between various parts of the pipeline. If the number of registers required by all parts is greater than the number of registers available then the code generator has to find a way how to compensate that. It usually spills some registers (spill means moving a variable from a register to memory) in order to map different variables to the same register. However, a high number of spills in the generated code, especially in hottest paths, can cancel the advantage gained by removing temporary buffers.

The situation gets worse when generating pipelines on a 32-bit X86 architecture, which only contains 7 usable 32-bit general purpose registers. These registers are very easily consumed by parts to hold pointers to core data structures required by them to operate.

From the code generation perspective we differentiate between two main categories of registers:

  • GP registers - general purpose registers that can be used to perform arithmetic and logical operations. In addition, general purpose registers are used to hold pointers and they can be used to dereference them (aka read from memory or write into it).
  • SIMD registers - these registers are designed to work with vectors.

Each architecture usually contains both register sets as shown on the tables below:

Arch Usable GP Regs Usable SIMD Regs
X86 7 GP (32-bit) 8 MMX, 8 XMM/YMM
X64 15 GP (64-bit) 8 MMX, 16 XMM/YMM

MMX, XMM, YMM, and ZMM registers require MMX+, SSE/SSE2+, AVX/AVX2+, and AVX-512+ instruction sets, respectively.

As can be seen above, 32-bit x86 architecture doesn't offer much when it comes to registers. Blend2D deals with such problem by taking advantage of MMX and SSE2 registers at the same time so it requires at least SSE2 capable CPU to operate. This requirement is actually very conservative in terms of software design and makes Blend2D still run on more than 15 years old CPUs.

In addition to the aggressive use of MMX and SSE2 registers, the pipeline generator limits the maximum number of pixels that can be processed at the same time to 4 if running on a 32-bit X86 CPU.

Registers Used by Individual Parts

Various parts of the pipeline need to perform various operations that are often better done by specific registers (GP/MMX/SSE2). Blend2D applies the following rules for each defined part of the pipeline:

  • Fill - uses mostly GP registers to hold pointers to the destination buffer and data structures used by the rasterizer. Fills can also take advantage of XMM registers to calculate pixel coverages, but only the analytic filler does it.
  • Fetch - uses mostly MMX and SSE2 registers to hold the current state and to iterate over it. GP registers should only be used to hold pointers and for temporary indexes that are needed to read pixels from a gradient table or image.
  • Composition - uses exclusively SSE2 registers for pixel blending. It doesn't use MMX registers so they can be used by Fill and Fetch parts, it makes a very aggressive use of SSE2 registers instead.

... TO BE CONTINUED - this article is a work in progress ...

JIT DUMPS

This section is for people that are curious about assembly generated by Blend2D. The number of pipelines shown has been reduced to make it easier to look at.

The dots combined with letters at the end of the comment represent liveness analysis that comes from asmjit library:

  • . - Variable is alive, but not used by the instruction.
  • r - Variable is read.
  • w - Variable is overwritten.
  • x - Variable is read and written.
  • R - Variable is read and becomes dead afterwards.

FillRect (SourceOver/Solid)


    ; Signature 0x00000809 Dst{Fmt=1} Src{Fmt=1} Op{Fill=0 CompOp=1} Fetch{Type=0 Extra=0}
    L0:                                     ;                                      [....                ]
    push rbx                                ;                                      [                    ]
    push rbp                                ;                                      [                    ]
    push rsi                                ;                                      [                    ]
    push r12                                ;                                      [                    ]
    movaps oword ptr [rsp-40], xmm6         ;                                      [                    ]
    movaps oword ptr [rsp-24], xmm7         ;                                      [                    ]
    mov r12, 8791440519392                  ; mov swConstPool, 8791440519392       [...w                ]
    movaps xmm0, [r12-80]                   ; movaps xmm.u16_128, [swConstPool-80] [...rw               ]
    movaps xmm1, [r12-32]                   ; movaps xmm.u16_257, [swConstPool-32] [...r.w              ]
    mov rax, qword ptr [rcx+56]             ; mov dPtr, [worker+56]                [r.....w             ]
    mov ebx, dword ptr [rdx+60]             ; mov y, [ras+60]                      [.r.....w            ]
    mov rbp, rax                            ; mov dStride, dPtr                    [......r.w           ]
    mov esi, dword ptr [rdx+56]             ; mov w, [ras+56]                      [.r.......w          ]
    imul rax, ebx                           ; imul dPtr, y                         [......xr..          ]
    movd xmm2, dword ptr [r8]               ; movd pixel.pc, [fetchData]           [..R.......w         ]
    pshufd xmm2, xmm2, 0                    ; pshufd pixel.pc, pixel.pc, 0         [.. .......x         ]
    pshuflw xmm3, xmm2, 85                  ; pshuflw pixel.uia, pixel.pc, 85      [.. .......rw        ]
    pshufd xmm3, xmm3, 68                   ; pshufd pixel.uia, pixel.uia, 68      [.. ........x        ]
    psrlw xmm3, 8                           ; psrlw pixel.uia, 8                   [.. ........x        ]
    pxor xmm3, [r12-64]                     ; pxor pixel.uia, [swConstPool-64]     [.. r.......x        ]
    neg ebx                                 ; neg y                                [.. ....x....        ]
    lea rax, qword ptr [rax+rsi*4]          ; lea dPtr, [dPtr+w*4]                 [.. ...x..r..        ]
    neg esi                                 ; neg w                                [.. ......x..        ]
    add rax, qword ptr [rcx+40]             ; add dPtr, [worker+40]                [R. ...x.....        ]
    add esi, dword ptr [rdx+64]             ; add w, [ras+64]                      [ r ......x..        ]
    prefetch [rax], 1                       ; prefetch [dPtr], 1                   [ . ...r.....        ]
    lea ecx, [0+esi*4]                      ; lea x, [0+w*4]                       [ . ......r..w       ]
    add ebx, dword ptr [rdx+68]             ; add y, [ras+68]                      [ R ....x.....       ]
    sub rbp, ecx                            ; sub dStride, x                       [   .....x...R       ]
    L3:                                     ;                                      [   .........        ]
    prefetch [rax], 1                       ; prefetch [dPtr], 1                   [   ...r.....        ]
    mov ecx, esi                            ; mov x, w                             [   ......r..w       ]
    test al, 15                             ; test dPtr, 15                        [   ...r......       ]
    jz L8                                   ; jz L8                                [   ..........       ]
    L4:                                     ;                                      [   ..........       ]
    movd xmm4, [rax]                        ; movd c0, [dPtr]                      [   ...r......w      ]
    pmovzxbw xmm4, xmm4                     ; pmovzxbw c0, c0                      [   ..........x      ]
    pmullw xmm4, xmm3                       ; pmullw c0, pixel.uia                 [   ........r.x      ]
    paddw xmm4, xmm0                        ; paddw c0, xmm.u16_128                [   .r........x      ]
    pmulhuw xmm4, xmm1                      ; pmulhuw c0, xmm.u16_257              [   ..r.......x      ]
    packuswb xmm4, xmm4                     ; packuswb c0, c0                      [   ..........x      ]
    paddd xmm4, xmm2                        ; paddd c0, pixel.pc                   [   .......r..x      ]
    movd dword ptr [rax], xmm4              ; movd [dPtr], c0                      [   ...r......R      ]
    sub ecx, 1                              ; sub x, 1                             [   .........x       ]
    lea rax, [rax+4]                        ; lea dPtr, [dPtr+4]                   [   ...x......       ]
    jz L9                                   ; jz L9                                [   ..........       ]
    test al, 15                             ; test dPtr, 15                        [   ...r......       ]
    short jnz L4                            ; jnz L4                               [   ..........       ]
    L8:                                     ;                                      [   ..........       ]
    cmp ecx, 4                              ; cmp x, 4                             [   .........r       ]
    short jb L4                             ; jb L4                                [   ..........       ]
    sub ecx, 8                              ; sub x, 8                             [   .........x       ]
    js L6                                   ; js L6                                [   ..........       ]
    L5:                                     ;                                      [   ..........       ]
    pmovzxbw xmm4, qword ptr [rax]          ; pmovzxbw c0, [dPtr]                  [   ...r...... w     ]
    pmovzxbw xmm5, qword ptr [rax+8]        ; pmovzxbw c1, [dPtr+8]                [   ...r...... .w    ]
    pmovzxbw xmm6, qword ptr [rax+16]       ; pmovzxbw c2, [dPtr+16]               [   ...r...... ..w   ]
    pmovzxbw xmm7, qword ptr [rax+24]       ; pmovzxbw c3, [dPtr+24]               [   ...r...... ...w  ]
    pmullw xmm4, xmm3                       ; pmullw c0, pixel.uia                 [   ........r. x...  ]
    pmullw xmm5, xmm3                       ; pmullw c1, pixel.uia                 [   ........r. .x..  ]
    pmullw xmm6, xmm3                       ; pmullw c2, pixel.uia                 [   ........r. ..x.  ]
    pmullw xmm7, xmm3                       ; pmullw c3, pixel.uia                 [   ........r. ...x  ]
    paddw xmm4, xmm0                        ; paddw c0, xmm.u16_128                [   .r........ x...  ]
    paddw xmm5, xmm0                        ; paddw c1, xmm.u16_128                [   .r........ .x..  ]
    pmulhuw xmm4, xmm1                      ; pmulhuw c0, xmm.u16_257              [   ..r....... x...  ]
    paddw xmm6, xmm0                        ; paddw c2, xmm.u16_128                [   .r........ ..x.  ]
    pmulhuw xmm5, xmm1                      ; pmulhuw c1, xmm.u16_257              [   ..r....... .x..  ]
    paddw xmm7, xmm0                        ; paddw c3, xmm.u16_128                [   .r........ ...x  ]
    pmulhuw xmm6, xmm1                      ; pmulhuw c2, xmm.u16_257              [   ..r....... ..x.  ]
    pmulhuw xmm7, xmm1                      ; pmulhuw c3, xmm.u16_257              [   ..r....... ...x  ]
    packuswb xmm4, xmm5                     ; packuswb c0, c1                      [   .......... xR..  ]
    packuswb xmm6, xmm7                     ; packuswb c2, c3                      [   .......... . xR  ]
    paddd xmm4, xmm2                        ; paddd c0, pixel.pc                   [   .......r.. x .   ]
    paddd xmm6, xmm2                        ; paddd c2, pixel.pc                   [   .......r.. . x   ]
    movaps [rax], xmm4                      ; movaps [dPtr], c0                    [   ...r...... R .   ]
    movaps [rax+16], xmm6                   ; movaps [dPtr+16], c2                 [   ...r......   R   ]
    add rax, 32                             ; add dPtr, 32                         [   ...x......       ]
    sub ecx, 8                              ; sub x, 8                             [   .........x       ]
    short jns L5                            ; jns L5                               [   ..........       ]
    L6:                                     ;                                      [   ..........       ]
    add ecx, 4                              ; add x, 4                             [   .........x       ]
    js L7                                   ; js L7                                [   ..........       ]
    pmovzxbw xmm4, qword ptr [rax]          ; pmovzxbw c0, [dPtr]                  [   ...r......     w ]
    pmovzxbw xmm5, qword ptr [rax+8]        ; pmovzxbw c1, [dPtr+8]                [   ...r......     .w]
    pmullw xmm4, xmm3                       ; pmullw c0, pixel.uia                 [   ........r.     x.]
    pmullw xmm5, xmm3                       ; pmullw c1, pixel.uia                 [   ........r.     .x]
    paddw xmm4, xmm0                        ; paddw c0, xmm.u16_128                [   .r........     x.]
    paddw xmm5, xmm0                        ; paddw c1, xmm.u16_128                [   .r........     .x]
    pmulhuw xmm4, xmm1                      ; pmulhuw c0, xmm.u16_257              [   ..r.......     x.]
    pmulhuw xmm5, xmm1                      ; pmulhuw c1, xmm.u16_257              [   ..r.......     .x]
    packuswb xmm4, xmm5                     ; packuswb c0, c1                      [   ..........     xR]
    paddd xmm4, xmm2                        ; paddd c0, pixel.pc                   [   .......r..     x ]
    movaps [rax], xmm4                      ; movaps [dPtr], c0                    [   ...r......     R ]
    add rax, 16                             ; add dPtr, 16                         [   ...x......       ]
    sub ecx, 4                              ; sub x, 4                             [   .........x       ]
    L7:                                     ;                                      [   ..........       ]
    add ecx, 4                              ; add x, 4                             [   .........x       ]
    jnz L4                                  ; jnz L4                               [   ..........       ]
    L9:                                     ;                                      [   .........        ]
    add rax, rbp                            ; add dPtr, dStride                    [   ...x.r...        ]
    sub ebx, 1                              ; sub y, 1                             [   ....x....        ]
    jnz L3                                  ; jnz L3                               [   .........        ]
    movdqa [rsp-56], xmm3                   ; [Spill] pixel.uia                    [                    ]
    L1:                                     ;                                      [                    ]
    movaps xmm6, oword ptr [rsp-40]         ;                                      [                    ]
    movaps xmm7, oword ptr [rsp-24]         ;                                      [                    ]
    pop r12                                 ;                                      [                    ]
    pop rsi                                 ;                                      [                    ]
    pop rbp                                 ;                                      [                    ]
    pop rbx                                 ;                                      [                    ]
    ret                                     ;                                      [                    ]
    ; Function size: 403 (bytes)
    

AnalyticFill (SrcOver/Solid)


    ; Signature 0x000008C9 Dst{Fmt=1} Src{Fmt=1} Op{Fill=3 CompOp=1} Fetch{Type=0 Extra=0}
    L0:                                     ;                                      [....          .                              ]
    push rbx                                ;                                      [                                             ]
    push rbp                                ;                                      [                                             ]
    push rsi                                ;                                      [                                             ]
    push rdi                                ;                                      [                                             ]
    push r12                                ;                                      [                                             ]
    movaps oword ptr [rsp-64], xmm6         ;                                      [                                             ]
    movaps oword ptr [rsp-48], xmm7         ;                                      [                                             ]
    movaps oword ptr [rsp-32], xmm8         ;                                      [                                             ]
    movaps oword ptr [rsp-16], xmm9         ;                                      [                                             ]
    mov r12, 8791440519392                  ; mov swConstPool, 8791440519392       [...w          .                              ]
    movaps xmm0, [r12-80]                   ; movaps xmm.u16_128, [swConstPool-80] [...rw         .                              ]
    movaps xmm1, [r12-32]                   ; movaps xmm.u16_257, [swConstPool-32] [...r.w        .                              ]
    mov eax, dword ptr [rdx+60]             ; mov y, [ras+60]                      [.r....w       .                              ]
    mov rbx, qword ptr [rdx+152]            ; mov rows, [ras+152]                  [.r.....w      .                              ]
    lea rbx, qword ptr [rbx+rax*8]          ; lea rows, [rows+y*8]                 [......rx      .                              ]
    mov rbp, eax                            ; mov dPtr, y                          [......r.w     .                              ]
    mov rsi, qword ptr [rcx+56]             ; mov dStride, [worker+56]             [r........w    .                              ]
    imul rbp, rsi                           ; imul dPtr, dStride                   [........xr    .                              ]
    add rbp, qword ptr [rcx+40]             ; add dPtr, [worker+40]                [R.......x.    .                              ]
    mov [rsp-80], rsi                       ; [Spill] dStride                      [                                             ]
    mov rcx, qword ptr [rbx]                ; mov chunk, [rows]                    [ ......r..w   .                              ]
    movd xmm2, dword ptr [r8]               ; movd pixel.pc, [fetchData]           [ .R........w  .                              ]
    pshufd xmm2, xmm2, 0                    ; pshufd pixel.pc, pixel.pc, 0         [ . ........x  .                              ]
    pmovzxbw xmm3, xmm2                     ; pmovzxbw pixel.uc, pixel.pc          [ . ........Rw .                              ]
    neg eax                                 ; neg y                                [ . ...x.... . .                              ]
    mov esi, dword ptr [rdx+48]             ; mov xEnd, [ras+48]                   [ r ........ .w.                              ]
    add eax, dword ptr [rdx+68]             ; add y, [ras+68]                      [ R ...x.... ...                              ]
    mov [rsp-72], esi                       ; [Spill] xEnd                         [                                             ]
    test rcx, rcx                           ; test chunk, chunk                    [   .......r ...                              ]
    jz L5                                   ; jz L5                                [   ........ ...                              ]
    L2:                                     ;                                      [   ........ ..                               ]
    mov edx, dword ptr [rcx+16]             ; mov x, [chunk+16]                    [   .......r ..w                              ]
    xor esi, esi                            ; xor cover, cover                     [   ........ ...w                             ]
    lea rbp, qword ptr [rbp+rdx*4]          ; lea dPtr, [dPtr+x*4]                 [   .....x.. ..r.                             ]
    prefetch [rbp], 1                       ; prefetch [dPtr], 1                   [   .....r.. ....                             ]
    L3:                                     ;                                      [   ........ ....                             ]
    mov edi, dword ptr [rcx+20]             ; mov i, [chunk+20]                    [   .......r ....w                            ]
    lea r8, [rcx+24]                        ; lea cell, [chunk+24]                 [   .......r .....w                           ]
    cmp edi, dword ptr [rsp-72]             ; cmp i, [xEnd]                        [   ........ .r..r.                           ]
    mov rcx, qword ptr [rcx+8]              ; mov chunk, [chunk+8]                 [   .......x ......                           ]
    cmova edi, dword ptr [rsp-72]           ; cmova i, [xEnd]                      [   ........ .r..x.                           ]
    sub edi, edx                            ; sub i, x                             [   ........ ..r.x.                           ]
    prefetch [rcx], 1                       ; prefetch [chunk], 1                  [   .......r ......                           ]
    jz L4                                   ; jz L4                                [   ........ ......                           ]
    add edx, edi                            ; add x, i                             [   ........ ..x.r.                           ]
    sub edi, 4                              ; sub i, 4                             [   ........ ....x.                           ]
    js L10                                  ; js L10                               [   ........ ......                           ]
    movd xmm2, esi                          ; movd coverXmm, cover                 [   ........ ...R..w                          ]
    pshufd xmm2, xmm2, 0                    ; pshufd coverXmm, coverXmm, 0         [   ........ ... ..x                          ]
    L9:                                     ;                                      [   ........ ... ...                          ]
    movdqu xmm4, [r8]                       ; movdqu m0, [cell]                    [   ........ ... .r.w                         ]
    movdqu xmm5, [r8+16]                    ; movdqu tmp, [cell+16]                [   ........ ... .r..w                        ]
    pshufd xmm4, xmm4, 216                  ; pshufd m0, m0, 216                   [   ........ ... ...x.                        ]
    pshufd xmm5, xmm5, 216                  ; pshufd tmp, tmp, 216                 [   ........ ... ....x                        ]
    movdqa xmm6, xmm4                       ; movdqa m1, m0                        [   ........ ... ...r.w                       ]
    punpcklqdq xmm4, xmm5                   ; punpcklqdq m0, tmp                   [   ........ ... ...xr.                       ]
    punpckhqdq xmm6, xmm5                   ; punpckhqdq m1, tmp                   [   ........ ... ....rx                       ]
    movdqa xmm5, xmm4                       ; movdqa tmp, m0                       [   ........ ... ...rw.                       ]
    pslldq xmm5, 4                          ; pslldq tmp, 4                        [   ........ ... ....x.                       ]
    paddd xmm4, xmm5                        ; paddd m0, tmp                        [   ........ ... ...xr.                       ]
    pxor xmm5, xmm5                         ; pxor tmp, tmp                        [   ........ ... ....w.                       ]
    punpcklqdq xmm5, xmm4                   ; punpcklqdq tmp, m0                   [   ........ ... ...rx.                       ]
    psrad xmm6, 9                           ; psrad m1, 9                          [   ........ ... .....x                       ]
    paddd xmm4, xmm5                        ; paddd m0, tmp                        [   ........ ... ...xR.                       ]
    psubd xmm2, xmm4                        ; psubd coverXmm, m0                   [   ........ ... ..xr .                       ]
    add r8, 32                              ; add cell, 32                         [   ........ ... .x.. .                       ]
    paddd xmm6, xmm2                        ; paddd m1, coverXmm                   [   ........ ... ..r. x                       ]
    pabsd xmm4, xmm6                        ; pabsd m0, m1                         [   ........ ... ...x R                       ]
    packssdw xmm4, xmm4                     ; packssdw m0, m0                      [   ........ ... ...x                         ]
    pminsw xmm4, [r12-48]                   ; pminsw m0, [swConstPool-48]          [   r....... ... ...x                         ]
    punpcklwd xmm4, xmm4                    ; punpcklwd m0, m0                     [   ........ ... ...x                         ]
    pshufd xmm2, xmm2, 255                  ; pshufd coverXmm, coverXmm, 255       [   ........ ... ..x.                         ]
    pshufd xmm6, xmm4, 250                  ; pshufd m1, m0, 250                   [   ........ ... ...r w                       ]
    pshufd xmm4, xmm4, 80                   ; pshufd m0, m0, 80                    [   ........ ... ...x .                       ]
    pmovzxbw xmm5, qword ptr [rbp]          ; pmovzxbw c0, [dPtr]                  [   .....r.. ... .... .w                      ]
    pmovzxbw xmm7, qword ptr [rbp+8]        ; pmovzxbw c1, [dPtr+8]                [   .....r.. ... .... ..w                     ]
    movaps xmm8, xmm3                       ; movaps p.uc0, pixel.uc               [   ........ r.. .... ...w                    ]
    movaps xmm9, xmm3                       ; movaps p.uc1, pixel.uc               [   ........ r.. .... ....w                   ]
    pmullw xmm8, xmm4                       ; pmullw p.uc0, m0                     [   ........ ... ...R ...x.                   ]
    pmullw xmm9, xmm6                       ; pmullw p.uc1, m1                     [   ........ ... ...  R...x                   ]
    psrlw xmm8, 8                           ; psrlw p.uc0, 8                       [   ........ ... ...   ..x.                   ]
    psrlw xmm9, 8                           ; psrlw p.uc1, 8                       [   ........ ... ...   ...x                   ]
    pshuflw xmm4, xmm8, 255                 ; pshuflw t0, p.uc0, 255               [   ........ ... ...   ..r.w                  ]
    pshuflw xmm6, xmm9, 255                 ; pshuflw t1, p.uc1, 255               [   ........ ... ...   ...r.w                 ]
    pshufhw xmm4, xmm4, 255                 ; pshufhw t0, t0, 255                  [   ........ ... ...   ....x.                 ]
    pshufhw xmm6, xmm6, 255                 ; pshufhw t1, t1, 255                  [   ........ ... ...   .....x                 ]
    pxor xmm4, [r12-64]                     ; pxor t0, [swConstPool-64]            [   r....... ... ...   ....x.                 ]
    pxor xmm6, [r12-64]                     ; pxor t1, [swConstPool-64]            [   r....... ... ...   .....x                 ]
    pmullw xmm5, xmm4                       ; pmullw c0, t0                        [   ........ ... ...   x...R.                 ]
    pmullw xmm7, xmm6                       ; pmullw c1, t1                        [   ........ ... ...   .x.. R                 ]
    paddw xmm5, xmm0                        ; paddw c0, xmm.u16_128                [   .r...... ... ...   x...                   ]
    paddw xmm7, xmm0                        ; paddw c1, xmm.u16_128                [   .r...... ... ...   .x..                   ]
    pmulhuw xmm5, xmm1                      ; pmulhuw c0, xmm.u16_257              [   ..r..... ... ...   x...                   ]
    pmulhuw xmm7, xmm1                      ; pmulhuw c1, xmm.u16_257              [   ..r..... ... ...   .x..                   ]
    paddw xmm5, xmm8                        ; paddw c0, p.uc0                      [   ........ ... ...   x.R.                   ]
    paddw xmm7, xmm9                        ; paddw c1, p.uc1                      [   ........ ... ...   .x R                   ]
    packuswb xmm5, xmm7                     ; packuswb c0, c1                      [   ........ ... ...   xR                     ]
    movups [rbp], xmm5                      ; movups [dPtr], c0                    [   .....r.. ... ...   R                      ]
    add rbp, 16                             ; add dPtr, 16                         [   .....x.. ... ...                          ]
    sub edi, 4                              ; sub i, 4                             [   ........ ... x..                          ]
    jns L9                                  ; jns L9                               [   ........ ... ...                          ]
    movd esi, xmm2                          ; movd cover, coverXmm                 [   ........ ...w..R                          ]
    L10:                                    ;                                      [   ........ ......                           ]
    add edi, 4                              ; add i, 4                             [   ........ ....x.                           ]
    jz L8                                   ; jz L8                                [   ........ ......                           ]
    L7:                                     ;                                      [   ........ ......                           ]
    mov r9d, dword ptr [r8+4]               ; mov msk, [cell+4]                    [   ........ .....r          w                ]
    sub esi, dword ptr [r8]                 ; sub cover, [cell]                    [   ........ ...x.r          .                ]
    sar r9d, 9                              ; sar msk, 9                           [   ........ ......          x                ]
    xor r10d, r10d                          ; xor tmp, tmp                         [   ........ ......          .w               ]
    add r9d, esi                            ; add msk, cover                       [   ........ ...r..          x.               ]
    add r8, 8                               ; add cell, 8                          [   ........ .....x          ..               ]
    sub r10d, r9d                           ; sub tmp, msk                         [   ........ ......          rx               ]
    cmovns r9d, r10d                        ; cmovns msk, tmp                      [   ........ ......          xr               ]
    mov r10d, 256                           ; mov tmp, 256                         [   ........ ......          .w               ]
    cmp r9d, 256                            ; cmp msk, 256                         [   ........ ......          r.               ]
    cmova r9d, r10d                         ; cmova msk, tmp                       [   ........ ......          xR               ]
    movd xmm2, r9d                          ; movd m.xmm, msk                      [   ........ ......          R w              ]
    pshuflw xmm2, xmm2, 0                   ; pshuflw m.xmm, m.xmm, 0              [   ........ ......            x              ]
    movd xmm4, [rbp]                        ; movd c0, [dPtr]                      [   .....r.. ......            .w             ]
    pmovzxbw xmm4, xmm4                     ; pmovzxbw c0, c0                      [   ........ ......            .x             ]
    movaps xmm5, xmm3                       ; movaps p.uc0, pixel.uc               [   ........ r.....            ..w            ]
    pmullw xmm5, xmm2                       ; pmullw p.uc0, m.xmm                  [   ........ ......            R.x            ]
    psrlw xmm5, 8                           ; psrlw p.uc0, 8                       [   ........ ......             .x            ]
    pshuflw xmm2, xmm5, 255                 ; pshuflw a0, p.uc0, 255               [   ........ ......             .rw           ]
    pxor xmm2, [r12-64]                     ; pxor a0, [swConstPool-64]            [   r....... ......             ..x           ]
    pmullw xmm4, xmm2                       ; pmullw c0, a0                        [   ........ ......             x.R           ]
    paddw xmm4, xmm0                        ; paddw c0, xmm.u16_128                [   .r...... ......             x.            ]
    pmulhuw xmm4, xmm1                      ; pmulhuw c0, xmm.u16_257              [   ..r..... ......             x.            ]
    paddw xmm4, xmm5                        ; paddw c0, p.uc0                      [   ........ ......             xR            ]
    packuswb xmm4, xmm4                     ; packuswb c0, c0                      [   ........ ......             x             ]
    movd dword ptr [rbp], xmm4              ; movd [dPtr], c0                      [   .....r.. ......             R             ]
    sub edi, 1                              ; sub i, 1                             [   ........ ....x.                           ]
    lea rbp, [rbp+4]                        ; lea dPtr, [dPtr+4]                   [   .....x.. ......                           ]
    short jnz L7                            ; jnz L7                               [   ........ ......                           ]
    L8:                                     ;                                      [   ........ ....                             ]
    mov edi, dword ptr [rcx+16]             ; mov i, [chunk+16]                    [   .......r ....w                            ]
    cmp edx, edi                            ; cmp x, i                             [   ........ ..r.r                            ]
    je L3                                   ; je L3                                [   ........ .....                            ]
    ja L4                                   ; ja L4                                [   ........ .....                            ]
    sub edi, edx                            ; sub i, x                             [   ........ ..r.x                            ]
    xor r9d, r9d                            ; xor msk, msk                         [   ........ .....                 w          ]
    add edx, edi                            ; add x, i                             [   ........ ..x.r                 .          ]
    sub r9d, esi                            ; sub msk, cover                       [   ........ ...r.                 x          ]
    mov r10d, 256                           ; mov tmp, 256                         [   ........ .....                 .w         ]
    jz L11                                  ; jz L11                               [   ........ .....                 ..         ]
    cmovs r9d, esi                          ; cmovs msk, cover                     [   ........ ...r.                 x.         ]
    cmp r9d, r10d                           ; cmp msk, tmp                         [   ........ .....                 rr         ]
    prefetch [rbp], 1                       ; prefetch [dPtr], 1                   [   .....r.. .....                 ..         ]
    cmova r9d, r10d                         ; cmova msk, tmp                       [   ........ .....                 xR         ]
    movd xmm2, r9d                          ; movd m.xmm, msk                      [   ........ .....                 R w        ]
    pshuflw xmm2, xmm2, 0                   ; pshuflw m.xmm, m.xmm, 0              [   ........ .....                   x        ]
    pshufd xmm2, xmm2, 68                   ; pshufd m.xmm, m.xmm, 68              [   ........ .....                   x        ]
    pshufd xmm2, xmm2, 68                   ; pshufd m.xmm, m.xmm, 68              [   ........ .....                   x        ]
    movaps xmm4, xmm3                       ; movaps p.px, pixel.uc                [   ........ r....                   .w       ]
    pmullw xmm4, xmm2                       ; pmullw p.px, m.xmm                   [   ........ .....                   Rx       ]
    psrlw xmm4, 8                           ; psrlw p.px, 8                        [   ........ .....                    x       ]
    pshuflw xmm2, xmm4, 255                 ; pshuflw m.xmm, p.px, 255             [   ........ .....                   wr       ]
    packuswb xmm4, xmm4                     ; packuswb p.px, p.px                  [   ........ .....                   .x       ]
    pshufd xmm2, xmm2, 0                    ; pshufd m.xmm, m.xmm, 0               [   ........ .....                   x.       ]
    pxor xmm2, [r12-64]                     ; pxor m.xmm, [swConstPool-64]         [   r....... .....                   x.       ]
    rex test bpl, 15                        ; test dPtr, 15                        [   .....r.. .....                   ..       ]
    jz L16                                  ; jz L16                               [   ........ .....                   ..       ]
    L12:                                    ;                                      [   ........ .....                   ..       ]
    movd xmm5, [rbp]                        ; movd c0, [dPtr]                      [   .....r.. .....                   ..w      ]
    pmovzxbw xmm5, xmm5                     ; pmovzxbw c0, c0                      [   ........ .....                   ..x      ]
    pmullw xmm5, xmm2                       ; pmullw c0, m.xmm                     [   ........ .....                   r.x      ]
    paddw xmm5, xmm0                        ; paddw c0, xmm.u16_128                [   .r...... .....                   ..x      ]
    pmulhuw xmm5, xmm1                      ; pmulhuw c0, xmm.u16_257              [   ..r..... .....                   ..x      ]
    packuswb xmm5, xmm5                     ; packuswb c0, c0                      [   ........ .....                   ..x      ]
    paddd xmm5, xmm4                        ; paddd c0, p.px                       [   ........ .....                   .rx      ]
    movd dword ptr [rbp], xmm5              ; movd [dPtr], c0                      [   .....r.. .....                   ..R      ]
    sub edi, 1                              ; sub i, 1                             [   ........ ....x                   ..       ]
    lea rbp, [rbp+4]                        ; lea dPtr, [dPtr+4]                   [   .....x.. .....                   ..       ]
    jz L17                                  ; jz L17                               [   ........ .....                   ..       ]
    rex test bpl, 15                        ; test dPtr, 15                        [   .....r.. .....                   ..       ]
    short jnz L12                           ; jnz L12                              [   ........ .....                   ..       ]
    L16:                                    ;                                      [   ........ .....                   ..       ]
    cmp edi, 4                              ; cmp i, 4                             [   ........ ....r                   ..       ]
    short jb L12                            ; jb L12                               [   ........ .....                   ..       ]
    sub edi, 8                              ; sub i, 8                             [   ........ ....x                   ..       ]
    js L14                                  ; js L14                               [   ........ .....                   ..       ]
    L13:                                    ;                                      [   ........ .....                   ..       ]
    pmovzxbw xmm5, qword ptr [rbp]          ; pmovzxbw c0, [dPtr]                  [   .....r.. .....                   .. w     ]
    pmovzxbw xmm6, qword ptr [rbp+8]        ; pmovzxbw c1, [dPtr+8]                [   .....r.. .....                   .. .w    ]
    pmovzxbw xmm7, qword ptr [rbp+16]       ; pmovzxbw c2, [dPtr+16]               [   .....r.. .....                   .. ..w   ]
    pmovzxbw xmm8, qword ptr [rbp+24]       ; pmovzxbw c3, [dPtr+24]               [   .....r.. .....                   .. ...w  ]
    pmullw xmm5, xmm2                       ; pmullw c0, m.xmm                     [   ........ .....                   r. x...  ]
    pmullw xmm6, xmm2                       ; pmullw c1, m.xmm                     [   ........ .....                   r. .x..  ]
    pmullw xmm7, xmm2                       ; pmullw c2, m.xmm                     [   ........ .....                   r. ..x.  ]
    pmullw xmm8, xmm2                       ; pmullw c3, m.xmm                     [   ........ .....                   r. ...x  ]
    paddw xmm5, xmm0                        ; paddw c0, xmm.u16_128                [   .r...... .....                   .. x...  ]
    paddw xmm6, xmm0                        ; paddw c1, xmm.u16_128                [   .r...... .....                   .. .x..  ]
    pmulhuw xmm5, xmm1                      ; pmulhuw c0, xmm.u16_257              [   ..r..... .....                   .. x...  ]
    paddw xmm7, xmm0                        ; paddw c2, xmm.u16_128                [   .r...... .....                   .. ..x.  ]
    pmulhuw xmm6, xmm1                      ; pmulhuw c1, xmm.u16_257              [   ..r..... .....                   .. .x..  ]
    paddw xmm8, xmm0                        ; paddw c3, xmm.u16_128                [   .r...... .....                   .. ...x  ]
    pmulhuw xmm7, xmm1                      ; pmulhuw c2, xmm.u16_257              [   ..r..... .....                   .. ..x.  ]
    pmulhuw xmm8, xmm1                      ; pmulhuw c3, xmm.u16_257              [   ..r..... .....                   .. ...x  ]
    packuswb xmm5, xmm6                     ; packuswb c0, c1                      [   ........ .....                   .. xR..  ]
    packuswb xmm7, xmm8                     ; packuswb c2, c3                      [   ........ .....                   .. . xR  ]
    paddd xmm5, xmm4                        ; paddd c0, p.px                       [   ........ .....                   .r x .   ]
    paddd xmm7, xmm4                        ; paddd c2, p.px                       [   ........ .....                   .r . x   ]
    movaps [rbp], xmm5                      ; movaps [dPtr], c0                    [   .....r.. .....                   .. R .   ]
    movaps [rbp+16], xmm7                   ; movaps [dPtr+16], c2                 [   .....r.. .....                   ..   R   ]
    add rbp, 32                             ; add dPtr, 32                         [   .....x.. .....                   ..       ]
    sub edi, 8                              ; sub i, 8                             [   ........ ....x                   ..       ]
    short jns L13                           ; jns L13                              [   ........ .....                   ..       ]
    L14:                                    ;                                      [   ........ .....                   ..       ]
    add edi, 4                              ; add i, 4                             [   ........ ....x                   ..       ]
    js L15                                  ; js L15                               [   ........ .....                   ..       ]
    pmovzxbw xmm5, qword ptr [rbp]          ; pmovzxbw c0, [dPtr]                  [   .....r.. .....                   ..     w ]
    pmovzxbw xmm6, qword ptr [rbp+8]        ; pmovzxbw c1, [dPtr+8]                [   .....r.. .....                   ..     .w]
    pmullw xmm5, xmm2                       ; pmullw c0, m.xmm                     [   ........ .....                   r.     x.]
    pmullw xmm6, xmm2                       ; pmullw c1, m.xmm                     [   ........ .....                   r.     .x]
    paddw xmm5, xmm0                        ; paddw c0, xmm.u16_128                [   .r...... .....                   ..     x.]
    paddw xmm6, xmm0                        ; paddw c1, xmm.u16_128                [   .r...... .....                   ..     .x]
    pmulhuw xmm5, xmm1                      ; pmulhuw c0, xmm.u16_257              [   ..r..... .....                   ..     x.]
    pmulhuw xmm6, xmm1                      ; pmulhuw c1, xmm.u16_257              [   ..r..... .....                   ..     .x]
    packuswb xmm5, xmm6                     ; packuswb c0, c1                      [   ........ .....                   ..     xR]
    paddd xmm5, xmm4                        ; paddd c0, p.px                       [   ........ .....                   .r     x ]
    movaps [rbp], xmm5                      ; movaps [dPtr], c0                    [   .....r.. .....                   ..     R ]
    add rbp, 16                             ; add dPtr, 16                         [   .....x.. .....                   ..       ]
    sub edi, 4                              ; sub i, 4                             [   ........ ....x                   ..       ]
    L15:                                    ;                                      [   ........ .....                   ..       ]
    add edi, 4                              ; add i, 4                             [   ........ ....x                   ..       ]
    jnz L12                                 ; jnz L12                              [   ........ .....                   ..       ]
    L17:                                    ;                                      [   ........ ....                             ]
    jmp L3                                  ; jmp L3                               [   ........ ....                             ]
    L11:                                    ;                                      [   ........ .....                            ]
    lea rbp, qword ptr [rbp+rdi*4]          ; lea dPtr, [dPtr+i*4]                 [   .....x.. ....r                            ]
    jmp L3                                  ; jmp L3                               [   ........ .....                            ]
    L4:                                     ;                                      [   .......  ...                              ]
    add rbx, 8                              ; add rows, 8                          [   ....x..  ...                              ]
    sub eax, 1                              ; sub y, 1                             [   ...x...  ...                              ]
    jz L6                                   ; jz L6                                [   .......  ...                              ]
    shl edx, 2                              ; shl x, 2                             [   .......  ..x                              ]
    mov rcx, qword ptr [rbx]                ; mov chunk, [rows]                    [   ....r..w ...                              ]
    add rbp, qword ptr [rsp-80]             ; add dPtr, [dStride]                  [   .....xr. ...                              ]
    sub rbp, edx                            ; sub dPtr, x                          [   .....x.. ..r                              ]
    test rcx, rcx                           ; test chunk, chunk                    [   .......r ...                              ]
    jnz L2                                  ; jnz L2                               [   ........ ...                              ]
    L5:                                     ;                                      [   .......  ...                              ]
    add rbx, 8                              ; add rows, 8                          [   ....x..  ...                              ]
    sub eax, 1                              ; sub y, 1                             [   ...x...  ...                              ]
    jz L6                                   ; jz L6                                [   .......  ...                              ]
    mov rcx, qword ptr [rbx]                ; mov chunk, [rows]                    [   ....r..w ...                              ]
    add rbp, qword ptr [rsp-80]             ; add dPtr, [dStride]                  [   .....xr. ...                              ]
    test rcx, rcx                           ; test chunk, chunk                    [   .......r ...                              ]
    short jz L5                             ; jz L5                                [   ........ ...                              ]
    jmp L2                                  ; jmp L2                               [   ........ ...                              ]
    L6:                                     ;                                      [                                             ]
    L1:                                     ;                                      [                                             ]
    movaps xmm6, oword ptr [rsp-64]         ;                                      [                                             ]
    movaps xmm7, oword ptr [rsp-48]         ;                                      [                                             ]
    movaps xmm8, oword ptr [rsp-32]         ;                                      [                                             ]
    movaps xmm9, oword ptr [rsp-16]         ;                                      [                                             ]
    pop r12                                 ;                                      [                                             ]
    pop rdi                                 ;                                      [                                             ]
    pop rsi                                 ;                                      [                                             ]
    pop rbp                                 ;                                      [                                             ]
    pop rbx                                 ;                                      [                                             ]
    ret                                     ;                                      [                                             ]
    ; Function size: 1046 (bytes)
    

All pipelines were generated on a SSE4.1 capable machine. They would look differently on a machine that supports just SSE2 or SSE3.