Skip to content

A compiler and DSL which uses MLIR to compile stateful signal processing kernels to vectorized assembly

Notifications You must be signed in to change notification settings

foltik/SignalScript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SignalScript

SignalScript uses MLIR to compile stateful signal processing kernels to vectorized assembly. The compiler parses source code into an AST, lowers it to standard MLIR dialects (func, arith, memref, affine), and then uses LLVM to produce optimized assembly.

From a single pointwise kernel, SignalScript automatically generates variants which are batched across instruments, time, or both. In each variant, state is automatically laid out for optimal vectorization and data locality. With modest effort, SignalScript could generate Cuda or SPIR-V GPU kernels with the same approach.

Usage

nix develop
make
./build/ssc --help

Example

fn ema(curr, alpha) {
    state prev;

    next = (alpha * curr) + ((1.0 - alpha) * prev);
    prev = next;
    return next;
}

fn rate_of_change(curr) {
    state prev;

    delta = curr - prev;
    prev = curr;
    return delta;
}

fn pipeline(x) {
    y = rate_of_change(x);
    z = ema(y, 0.9);
    return z;
}

Scalar Mode (default)

By default, SignalScript generates a function that runs the kernel on one input value, returning one output value. It has the signature (float input, float[S] state) -> float, where S is the total size of the state declared within the kernel.

./build/ssc test/kernels.ss --fn=pipeline --avx512

This results in the following assembly, where:

  • input: xmm0
  • rate_of_change.prev: [rdi]
  • ema.prev: [rdi + 4]
  • output: xmm0
pipeline:
        vsubss  xmm1, xmm0, dword ptr [rdi]
        vmovss  dword ptr [rdi], xmm0
        vmulss  xmm0, xmm1, dword ptr [rip + .LCPI0_0]
        vmovss  xmm1, dword ptr [rdi + 4]
        vmulss  xmm1, xmm1, dword ptr [rip + .LCPI0_1]
        vaddss  xmm0, xmm0, xmm1
        vmovss  dword ptr [rdi + 4], xmm0
        ret
.LCPI0_0:
        .long   0x3f666666 # float 0.899999976
.LCPI0_1:
        .long   0x3dccccd0 # float 0.100000024

Batched Mode

In batched mode (--batch=B), SignalScript generates a function that runs B instances of the original kernel in parallel, writing the outputs to a buffer. It has the signature (float input, float[S,B] state, float[B] output) -> ().

State variables are automatically converted to Structure-of-Arrays form (float[S, B] rather than float[B, S]) such that each individual state variable is contiguous in memory across all batches, allowing vectorization.

./build/ssc test/kernels.ss --fn=pipeline --batch=16 --avx512

This results in the following assembly, where:

  • input[16]: zmm0
  • rate_of_change.prev[16]: [rsi]
  • ema.prev[16]: [rsi + 64]
  • output[16]: [rdx]

Notice the computation of all 16 instances of the kernel in parallel via the 512 bit zmm registers.

pipeline_batched:
        vmovups zmm0, zmmword ptr [rdi]
        vsubps  zmm1, zmm0, zmmword ptr [rsi]
        vmovups zmmword ptr [rsi], zmm0
        vmulps  zmm0, zmm1, dword ptr [rip + .LCPI0_0]{1to16}
        vmovups zmm1, zmmword ptr [rsi + 64]
        vmulps  zmm1, zmm1, dword ptr [rip + .LCPI0_1]{1to16}
        vaddps  zmm0, zmm0, zmm1
        vmovups zmmword ptr [rsi + 64], zmm0
        vmovups zmmword ptr [rdx], zmm0
        vzeroupper
        ret
.LCPI0_0:
        .long   0x3f666666 # float 0.899999976
.LCPI0_1:
        .long   0x3dccccd0 # float 0.100000024

SignalScript can also generate Arm Neon instructions.

./build/ssc test/kernels.ss --fn=pipeline --batch=4 --neon
pipeline_batched:
        mov     w8, #26214
        movk    w8, #16230, lsl #16
        dup     v0.4s, w8
        mov     w8, #52432
        movk    w8, #15820, lsl #16
        dup     v1.4s, w8
        ldp     q2, q3, [x1]
        fmul    v1.4s, v3.4s, v1.4s
        ldr     q3, [x0]
        fsub    v2.4s, v3.4s, v2.4s
        fmul    v0.4s, v2.4s, v0.4s
        fadd    v0.4s, v0.4s, v1.4s
        stp     q3, q0, [x1]
        str     q0, [x2]
        ret

Temporal Mode

In temporal mode (--timesteps=T), SignalScript generates a function that runs a single instance of the original kernel over T sequential inputs, writing the output at each step to a buffer. It has the signature (float[T] input, float[S] state, float[T] output) -> ()

./build/ssc test/kernels.ss --fn=pipeline --timesteps=2 --avx512

This results in the following assembly, where:

  • input[2]: [rdi]
  • rate_of_change.prev: [rsi]
  • ema.prev: [rsi + 4]
  • output[2]: [rdx]

Note the loop unrolling.

pipeline_temporal:
        vmovss  xmm0, dword ptr [rdi]
        vsubss  xmm1, xmm0, dword ptr [rsi]
        vmovss  dword ptr [rsi], xmm0
        vmovss  xmm0, dword ptr [rip + .LCPI0_0]
        vmulss  xmm1, xmm1, xmm0
        vmovss  xmm2, dword ptr [rip + .LCPI0_1]
        vmulss  xmm3, xmm2, dword ptr [rsi + 4]
        vaddss  xmm1, xmm1, xmm3
        vmovss  dword ptr [rsi + 4], xmm1
        vmovss  dword ptr [rdx], xmm1
        vmovss  xmm1, dword ptr [rdi + 4]
        vsubss  xmm3, xmm1, dword ptr [rsi]
        vmovss  dword ptr [rsi], xmm1
        vmulss  xmm1, xmm2, dword ptr [rsi + 4]
        vmulss  xmm0, xmm3, xmm0
        vaddss  xmm0, xmm0, xmm1
        vmovss  dword ptr [rsi + 4], xmm0
        vmovss  dword ptr [rdx + 4], xmm0
        ret
.LCPI0_0:
        .long   0x3f666666 # float 0.899999976
.LCPI0_1:
        .long   0x3dccccd0 # float 0.100000024

Temporal Batched Model

In temporal batched mode (--timesteps=T --batch=B), SignalScript generates a function that runs B instances of the original kernel in parallel over T sequential inputs, writing the output at each step to a buffer. It has the signature (float[T] input, float[S,B] state, float[T,B] output) -> ()

State variables are again automatically converted to Structure-of-Arrays form (float[S,B] rather than float[B,S]) such that each individual state variable is contiguous in memory across all batches, allowing vectorization.

./build/ssc test/kernels.ss --fn=pipeline --batch=16 --timesteps=2 --avx512

This results in the following assembly, where:

  • input[2]: [rdi]
  • rate_of_change.prev[16]: [rsi]
  • ema.prev[16]: [rsi + 64]
  • output[2][16]: [rdx]

Notice the broadcasting of the scalar input, loop unrolling, and computation of all 16 instances of the kernel in parallel via the 512 bit zmm registers.

pipeline_temporal_batched:
        vbroadcastss    zmm0, dword ptr [rdi]
        vsubps  zmm1, zmm0, zmmword ptr [rsi]
        vmovups zmmword ptr [rsi], zmm0
        vbroadcastss    zmm0, dword ptr [rip + .LCPI0_0]
        vmulps  zmm1, zmm1, zmm0
        vbroadcastss    zmm2, dword ptr [rip + .LCPI0_1]
        vmulps  zmm3, zmm2, zmmword ptr [rsi + 64]
        vaddps  zmm1, zmm1, zmm3
        vmovups zmmword ptr [rsi + 64], zmm1
        vmovups zmmword ptr [rdx], zmm1
        vbroadcastss    zmm1, dword ptr [rdi + 4]
        vsubps  zmm3, zmm1, zmmword ptr [rsi]
        vmovups zmmword ptr [rsi], zmm1
        vmulps  zmm1, zmm2, zmmword ptr [rsi + 64]
        vmulps  zmm0, zmm3, zmm0
        vaddps  zmm0, zmm0, zmm1
        vmovups zmmword ptr [rsi + 64], zmm0
        vmovups zmmword ptr [rdx + 64], zmm0
        vzeroupper
        ret
.LCPI0_0:
        .long   0x3f666666 # float 0.899999976
.LCPI0_1:
        .long   0x3dccccd0 # float 0.100000024

About

A compiler and DSL which uses MLIR to compile stateful signal processing kernels to vectorized assembly

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published