SignalScript uses MLIR to compile stateful signal processing kernels to vectorized assembly. The compiler parses source code into an AST, lowers it to standard MLIR dialects (func, arith, memref, affine), and then uses LLVM to produce optimized assembly.
From a single pointwise kernel, SignalScript automatically generates variants which are batched across instruments, time, or both. In each variant, state is automatically laid out for optimal vectorization and data locality. With modest effort, SignalScript could generate Cuda or SPIR-V GPU kernels with the same approach.
nix develop
make
./build/ssc --helpfn ema(curr, alpha) {
state prev;
next = (alpha * curr) + ((1.0 - alpha) * prev);
prev = next;
return next;
}
fn rate_of_change(curr) {
state prev;
delta = curr - prev;
prev = curr;
return delta;
}
fn pipeline(x) {
y = rate_of_change(x);
z = ema(y, 0.9);
return z;
}By default, SignalScript generates a function that runs the kernel on one input
value, returning one output value. It has the signature (float input, float[S] state) -> float,
where S is the total size of the state declared within the kernel.
./build/ssc test/kernels.ss --fn=pipeline --avx512This results in the following assembly, where:
input:xmm0rate_of_change.prev:[rdi]ema.prev:[rdi + 4]output:xmm0
pipeline:
vsubss xmm1, xmm0, dword ptr [rdi]
vmovss dword ptr [rdi], xmm0
vmulss xmm0, xmm1, dword ptr [rip + .LCPI0_0]
vmovss xmm1, dword ptr [rdi + 4]
vmulss xmm1, xmm1, dword ptr [rip + .LCPI0_1]
vaddss xmm0, xmm0, xmm1
vmovss dword ptr [rdi + 4], xmm0
ret
.LCPI0_0:
.long 0x3f666666 # float 0.899999976
.LCPI0_1:
.long 0x3dccccd0 # float 0.100000024In batched mode (--batch=B), SignalScript generates a function that runs B
instances of the original kernel in parallel, writing the outputs to a buffer.
It has the signature (float input, float[S,B] state, float[B] output) -> ().
State variables are automatically converted to Structure-of-Arrays form
(float[S, B] rather than float[B, S]) such that each individual state
variable is contiguous in memory across all batches, allowing vectorization.
./build/ssc test/kernels.ss --fn=pipeline --batch=16 --avx512This results in the following assembly, where:
input[16]:zmm0rate_of_change.prev[16]:[rsi]ema.prev[16]:[rsi + 64]output[16]:[rdx]
Notice the computation of all 16 instances of the kernel in parallel via the 512 bit zmm registers.
pipeline_batched:
vmovups zmm0, zmmword ptr [rdi]
vsubps zmm1, zmm0, zmmword ptr [rsi]
vmovups zmmword ptr [rsi], zmm0
vmulps zmm0, zmm1, dword ptr [rip + .LCPI0_0]{1to16}
vmovups zmm1, zmmword ptr [rsi + 64]
vmulps zmm1, zmm1, dword ptr [rip + .LCPI0_1]{1to16}
vaddps zmm0, zmm0, zmm1
vmovups zmmword ptr [rsi + 64], zmm0
vmovups zmmword ptr [rdx], zmm0
vzeroupper
ret
.LCPI0_0:
.long 0x3f666666 # float 0.899999976
.LCPI0_1:
.long 0x3dccccd0 # float 0.100000024SignalScript can also generate Arm Neon instructions.
./build/ssc test/kernels.ss --fn=pipeline --batch=4 --neonpipeline_batched:
mov w8, #26214
movk w8, #16230, lsl #16
dup v0.4s, w8
mov w8, #52432
movk w8, #15820, lsl #16
dup v1.4s, w8
ldp q2, q3, [x1]
fmul v1.4s, v3.4s, v1.4s
ldr q3, [x0]
fsub v2.4s, v3.4s, v2.4s
fmul v0.4s, v2.4s, v0.4s
fadd v0.4s, v0.4s, v1.4s
stp q3, q0, [x1]
str q0, [x2]
retIn temporal mode (--timesteps=T), SignalScript generates a function that runs a
single instance of the original kernel over T sequential inputs, writing the output at each
step to a buffer. It has the signature (float[T] input, float[S] state, float[T] output) -> ()
./build/ssc test/kernels.ss --fn=pipeline --timesteps=2 --avx512This results in the following assembly, where:
input[2]:[rdi]rate_of_change.prev:[rsi]ema.prev:[rsi + 4]output[2]:[rdx]
Note the loop unrolling.
pipeline_temporal:
vmovss xmm0, dword ptr [rdi]
vsubss xmm1, xmm0, dword ptr [rsi]
vmovss dword ptr [rsi], xmm0
vmovss xmm0, dword ptr [rip + .LCPI0_0]
vmulss xmm1, xmm1, xmm0
vmovss xmm2, dword ptr [rip + .LCPI0_1]
vmulss xmm3, xmm2, dword ptr [rsi + 4]
vaddss xmm1, xmm1, xmm3
vmovss dword ptr [rsi + 4], xmm1
vmovss dword ptr [rdx], xmm1
vmovss xmm1, dword ptr [rdi + 4]
vsubss xmm3, xmm1, dword ptr [rsi]
vmovss dword ptr [rsi], xmm1
vmulss xmm1, xmm2, dword ptr [rsi + 4]
vmulss xmm0, xmm3, xmm0
vaddss xmm0, xmm0, xmm1
vmovss dword ptr [rsi + 4], xmm0
vmovss dword ptr [rdx + 4], xmm0
ret
.LCPI0_0:
.long 0x3f666666 # float 0.899999976
.LCPI0_1:
.long 0x3dccccd0 # float 0.100000024In temporal batched mode (--timesteps=T --batch=B), SignalScript generates a function that runs B
instances of the original kernel in parallel over T sequential inputs, writing the output at each
step to a buffer. It has the signature (float[T] input, float[S,B] state, float[T,B] output) -> ()
State variables are again automatically converted to Structure-of-Arrays form
(float[S,B] rather than float[B,S]) such that each individual state
variable is contiguous in memory across all batches, allowing vectorization.
./build/ssc test/kernels.ss --fn=pipeline --batch=16 --timesteps=2 --avx512This results in the following assembly, where:
input[2]:[rdi]rate_of_change.prev[16]:[rsi]ema.prev[16]:[rsi + 64]output[2][16]:[rdx]
Notice the broadcasting of the scalar input, loop unrolling, and computation of all 16 instances of the kernel in parallel via the 512 bit zmm registers.
pipeline_temporal_batched:
vbroadcastss zmm0, dword ptr [rdi]
vsubps zmm1, zmm0, zmmword ptr [rsi]
vmovups zmmword ptr [rsi], zmm0
vbroadcastss zmm0, dword ptr [rip + .LCPI0_0]
vmulps zmm1, zmm1, zmm0
vbroadcastss zmm2, dword ptr [rip + .LCPI0_1]
vmulps zmm3, zmm2, zmmword ptr [rsi + 64]
vaddps zmm1, zmm1, zmm3
vmovups zmmword ptr [rsi + 64], zmm1
vmovups zmmword ptr [rdx], zmm1
vbroadcastss zmm1, dword ptr [rdi + 4]
vsubps zmm3, zmm1, zmmword ptr [rsi]
vmovups zmmword ptr [rsi], zmm1
vmulps zmm1, zmm2, zmmword ptr [rsi + 64]
vmulps zmm0, zmm3, zmm0
vaddps zmm0, zmm0, zmm1
vmovups zmmword ptr [rsi + 64], zmm0
vmovups zmmword ptr [rdx + 64], zmm0
vzeroupper
ret
.LCPI0_0:
.long 0x3f666666 # float 0.899999976
.LCPI0_1:
.long 0x3dccccd0 # float 0.100000024