Skip to content

Program Dispatch Changes to Support MeshWorkloads #16356

Closed
@tt-asaigal

Description

Host Side Program dispatch can be broken down into six steps:

  1. Kernel Binary Compile (program.compile)
  2. L1 offset generation (program.finalize)
  3. Lowering: Dispatch Command Generation/Assembly (this is done through functions in EnqueueProgramCommand)
  4. Lowering: Reserve space for program in the kernel config ring buffer
  5. Lowering: Dispatch Command Updates (done through update_program_dispatch_commands in EnqueueProgramCommand)
  6. Writing Dispatch Commands to the Issue Queue

A MeshWorkload can contain multiple programs mapped to different devices. The intent is to compile + finalize + lower a program once, and eventually multithread/broadcast step 6 across devices.

Today, dispatch commands are generated or updated each time a program is enqueued to a device, which:

  1. Leads to repeated work on host if devices maintain a shared state (this is true for MeshWorkloads)
  2. Makes parallel dispatch impossible, since each thread modifies a shared data-structure (the program command sequence) each time it enqueues a program

Enabling MeshWorkload requires the ability to reuse dispatch commands across devices.

This requires:

  1. Program Lowering to be decoupled from EnqueueProgramCommand, i.e. fast dispatch command generation and updates should be treated the same way as program compilation (this is possible with Virtual Coordinates)
  2. enqueue_program_dispatch_core to match across all devices since the dispatch commands contain the dispatch core (in the go signal)

Decoupling the lowering step from the dispatch step enables the following:

auto program = CreateRandomProgram();
// Program State Modifications: This can be done when creating/compiling a `MeshWorkload` containing this program
program.compile(device); // Compile Kernel Binaries
program.finalize(device); // Compute relative offsets in L1
program.lower(device); // Lower: Assemble Dispatch Commands
reserve_space_in_kernel_config_buffer(program, device); // Reserve space for the program in the kernel config ring buffer
update_program_dispatch_commands(program); // Update dispatch commands for the program based on current state of the device

// Potentially Multi-Threaded Program Dispatch. Can also be broadcasted.
for (auto device : devices) {
       EnqueueProgramCommandSequence(program.dispatch_commands);
}

The scope of this work is to move the lowering step out of the EnqueueProgramCommand and into the program itself (inside the program.lower(device) API). Programs can then be lowered ahead of time.

Additionally, to maintain a shared code-base between programs and MeshWorkloads, utility functions responsible for program finalization and lowering will be placed in program_dispatch_utils.cpp to allow reuse.

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions