Improving the lowering and compilation of unrolled `lax.scan` loops #25336

carlosgmartin · 2024-12-08T18:05:02Z

carlosgmartin
Dec 8, 2024

lax.scan being slow is a common issue. Here are a few examples:

Jax scans are slower than expected #2491 (comment)
LU decomposition runtime is dominated by lu_pivots_to_permutation on GPU #5880 (comment)
inner jit functions are re-traced (and re-compiled) #7155 (comment)
Question regarding performance of jax.lax.scan #16106 (comment)
Fully unroll the scan in jnp.searchsorted, when method 'scan_unrolled' is specified. On GPU, XLA's 'scan' (fori_loop) implementation launches multiple calls to the body_fun GPU kernel, whereas a fully unrolled scan can be fused into a single kernel launch. #17509
Conditional array update on GPU using jnp.where vs fori_loop #19972 (comment)
JAX code is extremely slow on GPUs #24411 (comment)
lax.scan on map_coordinates slower on GPU than on CPU? #10794 (comment)

Often, the reason for this slowness is that lax.scan causes multiple kernel launches on GPU, as discussed here.

One solution to this problem is to use unrolling. The disadvantage of that solution is that it increases lowering and compilation times, sometimes dramatically so.

If JAX improved the lowering and compilation of such unrolled loops, it would allow users to get the best of both worlds: Fast execution and fast lowering/compilation.

Before we get to compilation, let's see how much we can optimize lowering.

Let's start with a simple example, which computes discounted returns for reinforcement learning:

import functools

import jax
from jax import lax, random


def get_returns(rewards, discounts, unroll=1):
    def f(carry, reward_discount):
        reward, discount = reward_discount
        new_carry = reward + discount * carry
        return new_carry, new_carry

    xs = rewards, discounts
    _, returns = lax.scan(f, 0.0, xs, unroll=unroll, reverse=True)
    return returns


def main():
    steps = 10

    key = random.key(0)
    keys = random.split(key)
    rewards = random.normal(keys[0], [steps])
    discounts = random.uniform(keys[1], [steps])

    for unroll in [True]:

        f = functools.partial(get_returns, unroll=unroll)

        # jaxpr = jax.make_jaxpr(f)(rewards, discounts)
        # print(jaxpr)

        lowered = jax.jit(f).lower(rewards, discounts)
        print(lowered.as_text())

        # compiled = lowered.compile()
        # print(compiled.as_text())


if __name__ == "__main__":
    main()

Here is the output:

$ py temp.py
module @jit__unnamed_wrapped_function_ attributes {mhlo.num_partitions = 1 : i32, mhlo.num_replicas = 1 : i32} {
  func.func public @main(%arg0: tensor<10xf32>, %arg1: tensor<10xf32>) -> (tensor<10xf32> {jax.result_info = ""}) {
    %cst = stablehlo.constant dense<0.000000e+00> : tensor<f32>
    %0 = stablehlo.slice %arg0 [0:10] : (tensor<10xf32>) -> tensor<10xf32>
    %1 = stablehlo.slice %arg0 [10:10] : (tensor<10xf32>) -> tensor<0xf32>
    %2 = stablehlo.slice %arg1 [0:10] : (tensor<10xf32>) -> tensor<10xf32>
    %3 = stablehlo.slice %arg1 [10:10] : (tensor<10xf32>) -> tensor<0xf32>
    %4 = stablehlo.reshape %1 : (tensor<0xf32>) -> tensor<0x10xf32>
    %5 = stablehlo.reshape %3 : (tensor<0xf32>) -> tensor<0x10xf32>
    %cst_0 = stablehlo.constant dense<0.000000e+00> : tensor<f32>
    %6 = stablehlo.broadcast_in_dim %cst_0, dims = [] : (tensor<f32>) -> tensor<0x10xf32>
    %7 = stablehlo.reshape %6 : (tensor<0x10xf32>) -> tensor<0xf32>
    %8 = stablehlo.slice %0 [9:10] : (tensor<10xf32>) -> tensor<1xf32>
    %9 = stablehlo.reshape %8 : (tensor<1xf32>) -> tensor<f32>
    %10 = stablehlo.slice %2 [9:10] : (tensor<10xf32>) -> tensor<1xf32>
    %11 = stablehlo.reshape %10 : (tensor<1xf32>) -> tensor<f32>
    %12:2 = call @None(%cst, %9, %11) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    %13 = stablehlo.slice %0 [8:9] : (tensor<10xf32>) -> tensor<1xf32>
    %14 = stablehlo.reshape %13 : (tensor<1xf32>) -> tensor<f32>
    %15 = stablehlo.slice %2 [8:9] : (tensor<10xf32>) -> tensor<1xf32>
    %16 = stablehlo.reshape %15 : (tensor<1xf32>) -> tensor<f32>
    %17:2 = call @None(%12#0, %14, %16) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    %18 = stablehlo.slice %0 [7:8] : (tensor<10xf32>) -> tensor<1xf32>
    %19 = stablehlo.reshape %18 : (tensor<1xf32>) -> tensor<f32>
    %20 = stablehlo.slice %2 [7:8] : (tensor<10xf32>) -> tensor<1xf32>
    %21 = stablehlo.reshape %20 : (tensor<1xf32>) -> tensor<f32>
    %22:2 = call @None(%17#0, %19, %21) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    %23 = stablehlo.slice %0 [6:7] : (tensor<10xf32>) -> tensor<1xf32>
    %24 = stablehlo.reshape %23 : (tensor<1xf32>) -> tensor<f32>
    %25 = stablehlo.slice %2 [6:7] : (tensor<10xf32>) -> tensor<1xf32>
    %26 = stablehlo.reshape %25 : (tensor<1xf32>) -> tensor<f32>
    %27:2 = call @None(%22#0, %24, %26) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    %28 = stablehlo.slice %0 [5:6] : (tensor<10xf32>) -> tensor<1xf32>
    %29 = stablehlo.reshape %28 : (tensor<1xf32>) -> tensor<f32>
    %30 = stablehlo.slice %2 [5:6] : (tensor<10xf32>) -> tensor<1xf32>
    %31 = stablehlo.reshape %30 : (tensor<1xf32>) -> tensor<f32>
    %32:2 = call @None(%27#0, %29, %31) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    %33 = stablehlo.slice %0 [4:5] : (tensor<10xf32>) -> tensor<1xf32>
    %34 = stablehlo.reshape %33 : (tensor<1xf32>) -> tensor<f32>
    %35 = stablehlo.slice %2 [4:5] : (tensor<10xf32>) -> tensor<1xf32>
    %36 = stablehlo.reshape %35 : (tensor<1xf32>) -> tensor<f32>
    %37:2 = call @None(%32#0, %34, %36) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    %38 = stablehlo.slice %0 [3:4] : (tensor<10xf32>) -> tensor<1xf32>
    %39 = stablehlo.reshape %38 : (tensor<1xf32>) -> tensor<f32>
    %40 = stablehlo.slice %2 [3:4] : (tensor<10xf32>) -> tensor<1xf32>
    %41 = stablehlo.reshape %40 : (tensor<1xf32>) -> tensor<f32>
    %42:2 = call @None(%37#0, %39, %41) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    %43 = stablehlo.slice %0 [2:3] : (tensor<10xf32>) -> tensor<1xf32>
    %44 = stablehlo.reshape %43 : (tensor<1xf32>) -> tensor<f32>
    %45 = stablehlo.slice %2 [2:3] : (tensor<10xf32>) -> tensor<1xf32>
    %46 = stablehlo.reshape %45 : (tensor<1xf32>) -> tensor<f32>
    %47:2 = call @None(%42#0, %44, %46) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    %48 = stablehlo.slice %0 [1:2] : (tensor<10xf32>) -> tensor<1xf32>
    %49 = stablehlo.reshape %48 : (tensor<1xf32>) -> tensor<f32>
    %50 = stablehlo.slice %2 [1:2] : (tensor<10xf32>) -> tensor<1xf32>
    %51 = stablehlo.reshape %50 : (tensor<1xf32>) -> tensor<f32>
    %52:2 = call @None(%47#0, %49, %51) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    %53 = stablehlo.slice %0 [0:1] : (tensor<10xf32>) -> tensor<1xf32>
    %54 = stablehlo.reshape %53 : (tensor<1xf32>) -> tensor<f32>
    %55 = stablehlo.slice %2 [0:1] : (tensor<10xf32>) -> tensor<1xf32>
    %56 = stablehlo.reshape %55 : (tensor<1xf32>) -> tensor<f32>
    %57:2 = call @None(%52#0, %54, %56) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    %58 = stablehlo.broadcast_in_dim %57#1, dims = [] : (tensor<f32>) -> tensor<1xf32>
    %59 = stablehlo.broadcast_in_dim %52#1, dims = [] : (tensor<f32>) -> tensor<1xf32>
    %60 = stablehlo.broadcast_in_dim %47#1, dims = [] : (tensor<f32>) -> tensor<1xf32>
    %61 = stablehlo.broadcast_in_dim %42#1, dims = [] : (tensor<f32>) -> tensor<1xf32>
    %62 = stablehlo.broadcast_in_dim %37#1, dims = [] : (tensor<f32>) -> tensor<1xf32>
    %63 = stablehlo.broadcast_in_dim %32#1, dims = [] : (tensor<f32>) -> tensor<1xf32>
    %64 = stablehlo.broadcast_in_dim %27#1, dims = [] : (tensor<f32>) -> tensor<1xf32>
    %65 = stablehlo.broadcast_in_dim %22#1, dims = [] : (tensor<f32>) -> tensor<1xf32>
    %66 = stablehlo.broadcast_in_dim %17#1, dims = [] : (tensor<f32>) -> tensor<1xf32>
    %67 = stablehlo.broadcast_in_dim %12#1, dims = [] : (tensor<f32>) -> tensor<1xf32>
    %68 = stablehlo.concatenate %58, %59, %60, %61, %62, %63, %64, %65, %66, %67, dim = 0 : (tensor<1xf32>, tensor<1xf32>, tensor<1xf32>, tensor<1xf32>, tensor<1xf32>, tensor<1xf32>, tensor<1xf32>, tensor<1xf32>, tensor<1xf32>, tensor<1xf32>) -> tensor<10xf32>
    %69 = stablehlo.concatenate %68, %7, dim = 0 : (tensor<10xf32>, tensor<0xf32>) -> tensor<10xf32>
    return %69 : tensor<10xf32>
  }
  func.func private @None(%arg0: tensor<f32>, %arg1: tensor<f32>, %arg2: tensor<f32>) -> (tensor<f32>, tensor<f32>) {
    %0 = stablehlo.convert %arg0 : tensor<f32>
    %1 = stablehlo.multiply %arg2, %0 : tensor<f32>
    %2 = stablehlo.add %arg1, %1 : tensor<f32>
    return %2, %2 : tensor<f32>, tensor<f32>
  }
}

As you can see, there are multiple patterns that repeat 10 times (which is the number of steps), as well as a single StableHLO function at the end, which is called once per step.

As a first step toward improving lowering/compile time (which could also facilitate further lowering/compile time optimizations downstream), it seems to me that it should be possible to fold all of the recurring patterns into a single StableHLO function like the one at the end, so that the only thing that gets repeated 10 times are call instructions, just one per step. In particular, we should be able to fold the following repeating pattern into a single function:

    %13 = stablehlo.slice %0 [8:9] : (tensor<10xf32>) -> tensor<1xf32>
    %14 = stablehlo.reshape %13 : (tensor<1xf32>) -> tensor<f32>
    %15 = stablehlo.slice %2 [8:9] : (tensor<10xf32>) -> tensor<1xf32>
    %16 = stablehlo.reshape %15 : (tensor<1xf32>) -> tensor<f32>
    %17:2 = call @None(%12#0, %14, %16) : (tensor<f32>, tensor<f32>, tensor<f32>) -> (tensor<f32>, tensor<f32>)
    ...
    %66 = stablehlo.broadcast_in_dim %17#1, dims = [] : (tensor<f32>) -> tensor<1xf32>

Would it be possible to do that, as a start?

carlosgmartin · 2024-12-08T20:06:28Z

carlosgmartin
Dec 8, 2024
Author

I created the following example: https://gist.github.com/carlosgmartin/a3055c7605157a54d48d108226a48b97.

Output:

venv $ time py better_scan.py --scan lax
len(lowered.as_text())=56412787
python3 better_scan.py --scan lax  23.26s user 3.64s system 118% cpu 22.758 total
venv $ time py better_scan.py --scan v1 
len(lowered.as_text())=20159635
python3 better_scan.py --scan v1  11.65s user 4.01s system 136% cpu 11.499 total
venv $ time py better_scan.py --scan v2
len(lowered.as_text())=20559933
python3 better_scan.py --scan v2  11.23s user 3.80s system 137% cpu 10.901 total

As you can see, the lowering time is faster, and the lowered expression smaller, for my (partial) re-implementations of scan.

It would be nice if there was a StableHLO primitive equivalent to the repeat(n, f, x) or repeat_with_outputs(n, f, x) functions in my gist... All it does is repeat the f function a constant/concrete/fixed n times on the input x. scan could be expressed in terms of such a primitive. Then the bounded-loop structure can be preserved all the way through compilation, making compilation very fast even for very long sequences.

In other words, we could optimize the loop body itself, but otherwise treat it as a "unit", and then "copy-paste" or "tile" it repeatedly at the very end of compilation.

0 replies

carlosgmartin · 2024-12-19T07:32:17Z

carlosgmartin
Dec 19, 2024
Author

I've also opened an issue about this at openxla/stablehlo#2664.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the lowering and compilation of unrolled `lax.scan` loops #25336

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Improving the lowering and compilation of unrolled lax.scan loops #25336

carlosgmartin Dec 8, 2024

Replies: 2 comments

carlosgmartin Dec 8, 2024 Author

carlosgmartin Dec 19, 2024 Author

Improving the lowering and compilation of unrolled `lax.scan` loops #25336

carlosgmartin
Dec 8, 2024

carlosgmartin
Dec 8, 2024
Author

carlosgmartin
Dec 19, 2024
Author