Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
f26004d
refactor
bnellnm Jul 29, 2025
cb6541a
disable cutlass quantization for now
bnellnm Jul 29, 2025
9d45646
fixes
bnellnm Jul 29, 2025
af89a1c
refactor layer construction
bnellnm Jul 30, 2025
c6fb973
fix
bnellnm Jul 30, 2025
11ace66
add cutlass dir to moe tests
bnellnm Jul 30, 2025
51c9e46
lint + fixes
bnellnm Jul 30, 2025
9ad35cc
fixes
bnellnm Jul 30, 2025
53bf1ad
fixes
bnellnm Jul 31, 2025
1f60cf2
fix
bnellnm Jul 31, 2025
229d08b
cleanups + update doc
bnellnm Aug 1, 2025
9776a0a
lint
bnellnm Aug 1, 2025
235ec9d
remove duplicated test code + improve reporting
bnellnm Aug 1, 2025
70b3ce1
refactor modular tests
bnellnm Aug 4, 2025
8abc29d
refactor
bnellnm Aug 5, 2025
6fa656e
lint
bnellnm Aug 5, 2025
d99cb97
add flashinfer tests
bnellnm Aug 5, 2025
67e7166
fix up nvfp4 tests
bnellnm Aug 5, 2025
0bd1df7
add fp4 support to moe test utils. add fp4 modular tests
bnellnm Aug 6, 2025
101fd03
add nvfp4 modular kernel test
bnellnm Aug 6, 2025
c32e443
fix up test
bnellnm Aug 6, 2025
13d5b80
remove dead code + add asserts
bnellnm Aug 6, 2025
5320b99
reduce K to get fp4 tests to pass
bnellnm Aug 6, 2025
0772868
review comments
bnellnm Aug 6, 2025
333fa1e
fix merge conflicts
bnellnm Aug 7, 2025
f3a8dfc
fix lint
bnellnm Aug 7, 2025
ef1e740
fix lint
bnellnm Aug 7, 2025
d5c0f60
more lint
bnellnm Aug 7, 2025
e8d5f15
more lint
bnellnm Aug 7, 2025
b5d3d91
try to fix lint
bnellnm Aug 7, 2025
4347fc1
add --timeout to data_parallel.py script
bnellnm Aug 8, 2025
22a94a5
fix merge
bnellnm Aug 8, 2025
b017948
fix merge
bnellnm Aug 8, 2025
9ae1e66
fix lint
bnellnm Aug 8, 2025
81e343c
fix test data
bnellnm Aug 11, 2025
8402df7
review comments
bnellnm Aug 12, 2025
ff8daa0
add flag for ep prepare
bnellnm Aug 12, 2025
64a1521
try to fix kernels/moe tests
bnellnm Aug 13, 2025
d861d27
debug modular kernel failure
bnellnm Aug 13, 2025
337648b
lint
bnellnm Aug 13, 2025
fc3f6ff
fix moe tests
bnellnm Aug 13, 2025
4796266
try to fix moe tests again
bnellnm Aug 13, 2025
175b409
change problem size so it is supported by deepep
bnellnm Aug 14, 2025
8dd2012
fix merge issues
bnellnm Aug 14, 2025
3f31e92
Merge branch 'main' into refactor
mgoin Aug 15, 2025
13ba1b1
Merge branch 'main' into refactor
bnellnm Aug 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,7 @@ steps:
- label: Kernels MoE Test %N
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- csrc/quantization/cutlass_w8a8/moe/
- csrc/moe/
- tests/kernels/moe
- vllm/model_executor/layers/fused_moe/
Expand Down
10 changes: 9 additions & 1 deletion docs/design/fused_moe_modular_kernel.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,11 +175,19 @@ implementations that input `FusedMoEActivationFormat.Standard` support chunking

### FusedMoEModularKernel Initialization

`FusedMoEMethodBase` class has 2 methods that are collectively responsible in creating the `FusedMoEModularKernel` object. They are,
`FusedMoEMethodBase` class has 3 methods that are collectively responsible in creating the `FusedMoEModularKernel` object. They are,

* maybe_make_prepare_finalize,
* select_gemm_impl, and
* init_prepare_finalize

#### maybe_make_prepare_finalize

The `maybe_make_prepare_finalize` method is responsbile for constructing an instance of `FusedMoEPrepareAndFinalize` when appropriate based on the current all2all backend, e.g. when EP + DP is enabled. The base class method currently constructs all the `FusedMoEPrepareAndFinalize` objects for the EP+DP case. Derived classes can override this method to construct prepare/finalize objects for different scenarios, e.g. `ModelOptNvFp4FusedMoE` can construct a `FlashInferCutlassMoEPrepareAndFinalize` for the EP+TP case.
Please refer to the implementations in,

* `ModelOptNvFp4FusedMoE`

#### select_gemm_impl

The `select_gemm_impl` method is undefined in the base class. It is the responsibility of the derived class to implement a method that constructs a valid/appropriate `FusedMoEPermuteExpertsUnpermute` object.
Expand Down
23 changes: 22 additions & 1 deletion examples/offline_inference/data_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,27 @@ def parse_args():
default=64,
help=("Maximum number of sequences to be processed in a single iteration."),
)
parser.add_argument(
"--max-model-len",
type=int,
help=("Maximum number of tokens to be processed in a single iteration."),
)
parser.add_argument(
"--timeout",
type=int,
default=300,
help=("Number of seconds before unresponsive process is killed."),
)
parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=0.8,
help=("Fraction of GPU memory vLLM is allowed to allocate (0.0, 1.0]."),
)
parser.add_argument(
"--quantization",
type=str,
)
return parser.parse_args()


Expand All @@ -90,7 +105,9 @@ def main(
enforce_eager,
trust_remote_code,
max_num_seqs,
max_model_len,
gpu_memory_utilization,
quantization,
):
os.environ["VLLM_DP_RANK"] = str(global_dp_rank)
os.environ["VLLM_DP_RANK_LOCAL"] = str(local_dp_rank)
Expand Down Expand Up @@ -142,7 +159,9 @@ def start(rank):
enable_expert_parallel=True,
trust_remote_code=trust_remote_code,
max_num_seqs=max_num_seqs,
max_model_len=max_model_len,
gpu_memory_utilization=gpu_memory_utilization,
quantization=quantization,
)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
Expand Down Expand Up @@ -198,14 +217,16 @@ def start(rank):
args.enforce_eager,
args.trust_remote_code,
args.max_num_seqs,
args.max_model_len,
args.gpu_memory_utilization,
args.quantization,
),
)
proc.start()
procs.append(proc)
exit_code = 0
for proc in procs:
proc.join(timeout=300)
proc.join(timeout=args.timeout)
if proc.exitcode is None:
print(f"Killing process {proc.pid} that didn't stop within 5 minutes.")
proc.kill()
Expand Down
Loading