This document defines a formal framework for simulating the performance of distributed systems characterized by interleaved data movement and computational workloads such as those found in LLM systems. The framework employs a Max-Min Fairness algorithm for bandwidth allocation across shared network segments and a discrete-event callback mechanism to model complex, state-dependent operational sequences. The main use case for this project is to facilitate the analyzis of various configurations of a disaggregated PD inference system via graphs and an interactive tool with 6 degrees of freedome.
A communication network is modeled as a set of edges CommChannel characterized by:
-
Bandwidth (
$B_e$ ): Peak data transfer rate in bytes per second. -
Latency (
$L_e$ ): Propagation delay in seconds.
A data transfer task DataBatch) is defined by its payload size
A computation task ComputeJob) is modeled as a time-delay
To resolve contention on shared edges, the simulator implements Max-Min Fairness. At each time step
- No batch receives more than its bottleneck rate.
- Batches with the same bottleneck are treated equally.
- Surplus bandwidth from constrained batches is redistributed to unconstrained ones.
The rate
The framework relies on an inversion-of-control pattern. Users must implement the SimulatedSystem interface to handle the lifecycle of tasks:
on_data_transfer_complete(simulator, batch): Invoked when a batch clears its path latency and its payload is fully serialized. May return aComputeJob, which the simulator adds to itsactive_computeset, or may addComputeJobs directly.on_compute_complete(simulator, job): Invoked when a computation's duration has elapsed.
The engine maintains two active queues (active_batches and active_compute) and a historical log of completed tasks. It executes a time-stepping loop where
- Latency Phase: A batch enters a "flight" state where
latency_remainingdecrements. - Transfer Phase: Once latency is zero, the batch participates in the Max-Min allocation.
- Event Triggering: Upon completion, the simulator queries the
SimulatedSystem. The system may inject new batches or compute jobs back into the engine, allowing for recursive workflow modeling.
Represents a payload traversing a set of segments.
- Attributes:
total_size: Size in bytes.rem_size: Bytes remaining to be transferred.path: An ordered list ofCommChannelobjects.latency_remaining: Accrued propagation delay that must expire before data begins moving.start_time/end_time: Timestamps for performance analysis.
Represents a physical or logical link between two points.
- Attributes:
bandwidth_bps: Average data rate in bytes per second.latency_s: Propagation delay in seconds.
- Behavior: Used by the simulator to determine both the "flight time" (latency) and the "serialization time" (throughput) of data.
Represents a task requiring execution time on a resource without consuming network bandwidth.
- Attributes:
total_duration: Execution time in seconds.rem_duration: Seconds remaining until completion.start_time/end_time: Timestamps for performance analysis.
Supporting metadata classes used to derive the total_size for DataBatch or total_duration for ComputeJob.
- GPU: Stores VRAM capacity, memory bandwidth, and peak FLOPS.
- LLM: Stores model dimensions (Weights, Layers, Heads) and KV-cache calculation logic.
Goal: Model a single data transfer across a PCIe link with propagation delay.
pci = CommChannel("PCIe Gen4")
sim = CommNetworkSimulator()
batch = DataBatch("Initial_Data", 1024**3, [pci]) # 1GB
sim.add_batch(batch)
class SimpleSystem(SimulatedSystem):
def on_data_transfer_complete(self, sim, batch):
print(f"Transfer complete at {sim.current_time}s")
def on_compute_complete(self, sim, job): pass
sim.run(SimpleSystem())Goal: Demonstrate Max-Min Fairness where a slow link prevents a user from hogging a fast link.
fast_link = CommChannel("NVLink") # 300GB/s
slow_link = CommChannel("Ethernet 100G") # 10GB/s
# Batch A uses the fast link only
# Batch B uses the fast link AND the slow link
sim = CommNetworkSimulator()
sim.add_batch(DataBatch("User_A", 10**9, [fast_link]))
sim.add_batch(DataBatch("User_B", 10**9, [fast_link, slow_link]))
# Behavior: User_B is bottlenecked by the slow_link.
# Max-Min gives User_A the remaining slack on fast_link.
sim.run(SilentSystem())Goal: Model a GPU processing a received batch and then sending results.
class RecursiveSystem(SimulatedSystem):
def on_data_transfer_complete(self, sim, batch):
if batch.name == "Inputs":
# Start a compute job (0.1s duration)
return ComputeJob("GPU_Kernel", 0.1)
def on_compute_complete(self, sim, job):
if job.name == "GPU_Kernel":
# Send results back
sim.add_batch(DataBatch("Results", 10**6, [pci]))
sim = CommNetworkSimulator()
sim.add_batch(DataBatch("Inputs", 10**7, [pci]))
sim.run(RecursiveSystem())Goal: Model a multi-hop pipeline where data passes through various mediums (PCIe -> InfiniBand -> Ethernet).
pci = CommChannel("PCIe Gen4")
ib = CommChannel("Infiniband NDR")
eth = CommChannel("Ethernet 100G")
class PipelineSystem(SimulatedSystem):
def on_data_transfer_complete(self, sim, batch):
if "Stage_1" in batch.name:
sim.add_batch(DataBatch("Stage_2", batch.total_size, [ib]))
elif "Stage_2" in batch.name:
return ComputeJob("Final_Agg", 0.05)
def on_compute_complete(self, sim, job):
print("Pipeline sequence finished.")
sim = CommNetworkSimulator()
sim.add_batch(DataBatch("Stage_1", 5*10**8, [pci]))
sim.run(PipelineSystem())This section formalizes the principal use case of the simulator: the quantitative analysis of disaggregated pipeline-parallel (PP) LLM inference systems. The framework enables systematic exploration of system configurations through graphical outputs and interactive tools supporting six degrees of freedom. Both logical and physical abstractions are employed to accurately model and simulate the operational characteristics of the disaggregated PD system.
The logical view models the system in terms of the dimensionality and partitioning of DataBatch and ComputeJob objects. For example, a configuration with 
The physical view describes the explicit transport paths and constituent edges traversed by each DataBatch. For clarity, the diagram below omits the distinction between full-duplex communication channels, which in a rigorous model would be represented as separate CommChannel instances for each direction.

The class DisaggregatedPDSystemPP implements the start() method, which creates the initial ComputeJob for the GPU of rank 0 in the prefill pipeline to kick-off the prefill process. After that, a call to CommNetworkSimulator.run() starts the simulation.
DisaggregatedPDSystemPP realizes the SimulatedSystem interface in order to receive calls to on_compute_complete() and on_data_transfer_complete() announcing the completion of ComputeJobs and DataBatch transfers. DisaggregatedPDSystemPP handles these calls by creating addditional ComputeJobs and/or DataBatches in order to orchestrate the disaggregated PD system's behavior.
After the simuation completes, the user may access the start/time of each completed ComputeJob and DataBatch transfer via CommNetworkSimulator.completed_compute[] and CommNetworkSimulator.completed_batches[] in order to analyze the performance of the simulated system.
To enable rigorous post-simulation analysis, the framework provides visualization modules that generate three-dimensional charts depicting the time to decode start (TTDS) as a function of key system parameters. In addition, an interactive Gantt chart visualizes the temporal evolution of all ComputeJob and DataBatch events, supporting detailed investigation of concurrency and resource allocation. The Gantt chart interface exposes six principal system parameters for user-driven exploration:
- Prefill GPU type (e.g.
NVIDIA H100,NVIDIA RTX PRO 6000,AMD Instinct MI300A, etc) - Decode GPU type (e.g.
NVIDIA H100,NVIDIA RTX PRO 6000,AMD Instinct MI300A, etc) - Pipeline parallelism degree (PP)
- Batch size (
$N$ ) - Sequence length (
$T$ ) - Prefill chunk size (
$M$ )
Limitation: The present implementation permits independent specification of the pipeline parallelism degree (
$\mathrm{PP}$ ) for both the prefill and decode pipelines. However, in a rigorous system, the decode pipeline should employ two-dimensional parallelism — specifically, a combination of pipeline parallelism ($\mathrm{PP}$ ) and tensor parallelism ($\mathrm{TP}$ ). Optimal performance requires that the degrees of$\mathrm{PP}$ and$\mathrm{TP}$ for the decode stage be selected jointly, reflecting the physical constraints and workload characteristics of the system.
The MTSweepVisualizer module systematically explores the impact of prefill chunk size (
MTSweepVisualizer generate M_T_TTDS_sweep_3d.html, which the user may load in a browser to display and manipulate the 3D graph.
The PPTSweepVisualizer module investigates how the degree of pipeline parallelism (
PPTSweepVisualizer generate PP_T_TTDS_sweep_3d.html, which the user may load in a browser to display and manipulate the 3D graph.
The Gantt6DVisualizer provides an interactive, high-dimensional visualization environment for analyzing the temporal dynamics of disaggregated LLM inference systems. This tool renders a Gantt chart that encodes the start and end times of all ComputeJob and DataBatch events, enabling detailed inspection of system concurrency and resource utilization. Users can manipulate six key system parameters—prefill GPU type, decode GPU type, pipeline parallelism degree (PP), batch size (N), sequence length (T), and prefill chunk size (M)—in real time. The visualizer computes and displays key performance metrics, including Time to Decode Start (TTDS), Time to First Token (TTFT), and Time per Output Token (TPOT), directly from the simulated event traces. This facilitates comprehensive scenario exploration, bottleneck identification, and sensitivity analysis of latency to various system parameters, making the tool a powerful instrument for both qualitative and quantitative performance evaluation.
The Gantt6DVisualizer implements a webapp running in a local server to which the user may connect via a browser in order to use the tool. Note that the server listens on all interfaces to enable remote users to connect to it. This represents a security risk which you may wish to mitigate by modifying the code to only listen on 127.0.0.1.





