Program Dispatch Changes to Support MeshWorkloads

Host Side Program dispatch can be broken down into six steps:
1. Kernel Binary Compile (`program.compile`)
2. L1 offset generation (`program.finalize`)
3. Lowering: Dispatch Command Generation/Assembly (this is done through functions in `EnqueueProgramCommand`)
4. Lowering: Reserve space for program in the kernel config ring buffer
5. Lowering: Dispatch Command Updates (done through `update_program_dispatch_commands` in `EnqueueProgramCommand`)
6. Writing Dispatch Commands to the Issue Queue

A `MeshWorkload` can contain multiple programs mapped to different devices. The intent is to compile + finalize + lower a program once, and eventually multithread/broadcast step 6 across devices.

Today, dispatch commands are generated or updated each time a program is enqueued to a device, which:
1. Leads to repeated work on host if devices maintain a shared state (this is true for `MeshWorkloads`)
2. Makes parallel dispatch impossible, since each thread modifies a shared data-structure (the program command sequence) each time it enqueues a program

Enabling `MeshWorkload` requires the ability to reuse dispatch commands across devices.

This requires:
1. Program Lowering to be decoupled from `EnqueueProgramCommand`, i.e. fast dispatch command generation and updates should be treated the same way as program compilation (this is possible with Virtual Coordinates)
2. `enqueue_program_dispatch_core` to match across all devices since the dispatch commands contain the dispatch core (in the go signal)

Decoupling the lowering step from the dispatch step enables the following:

```
auto program = CreateRandomProgram();
// Program State Modifications: This can be done when creating/compiling a `MeshWorkload` containing this program
program.compile(device); // Compile Kernel Binaries
program.finalize(device); // Compute relative offsets in L1
program.lower(device); // Lower: Assemble Dispatch Commands
reserve_space_in_kernel_config_buffer(program, device); // Reserve space for the program in the kernel config ring buffer
update_program_dispatch_commands(program); // Update dispatch commands for the program based on current state of the device

// Potentially Multi-Threaded Program Dispatch. Can also be broadcasted.
for (auto device : devices) {
       EnqueueProgramCommandSequence(program.dispatch_commands);
}
```

The scope of this work is to move the lowering step out of the `EnqueueProgramCommand` and into the program itself (inside the `program.lower(device)` API). Programs can then be lowered ahead of time.

Additionally, to maintain a shared code-base between programs and MeshWorkloads, utility functions responsible for program finalization and lowering will be placed in `program_dispatch_utils.cpp` to allow reuse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Program Dispatch Changes to Support MeshWorkloads #16356

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development