Support loading autotuned results from json for cutlass fp4 moe backends #1310

kaixih · 2025-07-23T17:57:25Z

This PR adds support for loading autotuned results from JSON files for the Cutlass FP4 MoE backends.

The script benchmarks/bench_cutlass_fused_moe.py generates a JSON file at configs/<flashinfer_version>/trtllm_fused_moe_<device_name>.json, mapping input shapes to the optimal config/tactic for GEMMs used in fused_moe.cutlass_fused_moe.

At runtime, setting the FLASHINFER_AUTOTUNER_LOAD_FROM_FILE environment variable enables loading from this file. If the variable is unset or a matching entry is not found, it falls back to the default config/tactic.

Configs are organized by flashinfer version and GPU device.

cc. @yzh119 @wenscarl @kushanam

gemini-code-assist

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial caching mechanism for autotuned results of the cutlass fp4 moe operation. By saving optimal configurations to a JSON file and allowing them to be loaded, it aims to significantly reduce the overhead of repeated autotuning, especially during inference. This change also includes a new set of pre-tuned configurations for NVIDIA B200 GPUs and minor improvements to the autotuning benchmark script and logging utility.

Highlights

Caching Autotuned Results: Implemented functionality to save and load autotuned configurations for cutlass fp4 moe operations to/from a JSON file, leveraging the flashinfer.autotuner module.
Pre-tuned Configurations for B200: Added a new JSON file (benchmarks/trtllm_fused_moe_NVIDIA_B200.json) containing pre-computed optimal configurations for NVIDIA B200 GPUs, specifically for trtllm::fused_moe::gemm1 and gemm2 operations across various input sizes.
Autotuning Process Refinement: Modified the benchmark script (benchmarks/bench_cutlass_fused_moe.py) to enable saving of autotuned results and adjusted the cutlass_fused_moe call to use a configurable tune_max_num_tokens parameter instead of a hardcoded value.
Dynamic Cache Loading: Enhanced the autotuner to conditionally load cached results from a JSON file based on the FLASHINFER_AUTOTUNER_LOAD_FROM_FILE environment variable, preventing redundant autotuning during inference when pre-tuned results are available.
Logging Utility Improvement: Introduced an info_once method in the JIT logger (flashinfer/jit/core.py) to prevent repetitive log messages, improving log readability during autotuning and cache loading.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a mechanism to cache autotuning results to a JSON file, which can then be loaded to avoid expensive tuning runs during inference. The changes include modifications to the autotuner, a benchmark script to generate the cache, and a new logger utility. The main feedback is on the implementation of loading from the JSON cache in flashinfer/autotuner.py. The current implementation is not robust against missing or corrupt cache files and is inefficient in its caching strategy. I've provided a suggestion to improve this. Overall, this is a valuable feature for improving performance by avoiding repeated autotuning.

gemini-code-assist · 2025-07-23T17:59:23Z

flashinfer/autotuner.py

+def load_from_json(key):
+    with open(get_json_path(), "r") as f:
+        configs = json.load(f)
+    k = str((key[0], key[1], key[3]))
+    if k in configs:
+        return True, configs[k][0], configs[k][1], None
+    return False, 0, -1, None


This implementation of load_from_json has two potential issues:

Lack of Error Handling: It doesn't handle FileNotFoundError if the cache file doesn't exist, or json.JSONDecodeError if the file is corrupt. This could cause the program to crash.

Inefficient Caching: The @lru_cache is applied to load_from_json which takes key as an argument. This means the JSON file will be re-read and re-parsed for every unique key, which is inefficient as the file content is the same for all keys within a run.

A better approach is to cache the result of reading the file itself. You can introduce a helper function to read and parse the JSON file and apply lru_cache to it. This ensures the file is read only once.

from functools import lru_cache import json @lru_cache(maxsize=1) def _read_autotune_cache_file(path): """Helper to read and parse the autotune cache file, cached.""" try: with open(path, "r") as f: return json.load(f) except (FileNotFoundError, json.JSONDecodeError): return None def load_from_json(key): json_path = get_json_path() configs = _read_autotune_cache_file(json_path) if configs is None: return False, 0, -1, None k = str((key[0], key[1], key[3])) if k in configs: return True, configs[k][0], configs[k][1], None return False, 0, -1, None

benchmarks/bench_cutlass_fused_moe.py

flashinfer/jit/core.py

benchmarks/bench_cutlass_fused_moe.py

yzh119

Great work, thanks for brining cutlass kernel tuning to flashinfer.
My main concern is how do we we store the best configs, currently I believe storing them as python objects under flashinfer.tuning_configs (I'm flexible with the naming) is easier for packaging compared to json:

If tuning_configs is a python module, we just need to add flashinfer.tuning_configs module to packages:

flashinfer/pyproject.toml

Line 40 in 43e08e9

packages = [
If they are json files, we have to update the package dir

flashinfer/pyproject.toml

Line 61 in 43e08e9

[tool.setuptools.package-dir]

and package data as well

flashinfer/pyproject.toml

Line 67 in 43e08e9

[tool.setuptools.package-data]

So in general I feel like using python to store them is the most convenient solution.

If the tuning configuration is large, we can consider host them on artifactory (we left it for future work).

flashinfer/configs/0.2.8/trtllm_fused_moe_NVIDIA_B200.json

benchmarks/bench_cutlass_fused_moe.py

wenscarl · 2025-07-25T18:57:53Z

Is it possible that the autotune pass is called at the profiling stage in the framework, and also FLASHINFER_AUTOTUNER_LOAD_FROM_FILE is set at the same time. What is the behavior then?

kaixih · 2025-07-25T21:50:40Z

@yzh119 I think all the comments are addressed. PTAL.

kaixih · 2025-07-25T21:52:50Z

Is it possible that the autotune pass is called at the profiling stage in the framework, and also FLASHINFER_AUTOTUNER_LOAD_FROM_FILE is set at the same time. What is the behavior then?

if we have such a stage in FW, we don't need the FLASHINFER_AUTOTUNER_LOAD_FROM_FILE since the configs will be stored in the Autotune's profiling_cache in the memory, right? @wenscarl

yzh119

Looks good! Left a minor suggestion

yzh119 · 2025-07-29T18:01:20Z

benchmarks/bench_cutlass_fused_moe.py

@@ -195,23 +186,44 @@ def bench_cutlass_fused_moe(
            output=flash_output,
        )
    )
+    avg_ms = sum(ms_list) / len(ms_list)
+    print("input\tweight1\tweight2\ttime(ms)")


These two lines are not aligned and displayed as:

input weight1 weight2 time(ms) (32, 3584) (32, 4096, 7168) (32, 7168, 2048) 0.19970719873905182

Suggested change

print("input\tweight1\tweight2\ttime(ms)")

print(f"{'input':<15} {'weight1':<20} {'weight2':<20} {'time(ms)'}")

print(

f"{str(tuple(hidden_states.shape)):<15} {str(tuple(w1.shape)):<20} {str(tuple(w2.shape)):<20} {avg_ms:.3f}"

)

Done. PTAL.

kaixih · 2025-07-29T19:52:48Z

This is a typical workflow:

# update the config file
if [[ "$1" == "update" ]]; then
  FLASHINFER_AUTOTUNER_LOAD_FROM_FILE=0 python bench_cutlass_fused_moe.py --num-tokens 4096 --update-config
  exit 0
fi

# benchmark
export FLASHINFER_AUTOTUNER_LOAD_FROM_FILE=0
for i in 1 2 4 8 16 24 32 48 64 96 128 256 512 1024 1536 2048 3072 4096 8192 16384; do
  python bench_cutlass_fused_moe.py --num-tokens $i --skip-autotune
done

export FLASHINFER_AUTOTUNER_LOAD_FROM_FILE=1
for i in 1 2 4 8 16 24 32 48 64 96 128 256 512 1024 1536 2048 3072 4096 8192 16384; do
  python bench_cutlass_fused_moe.py --num-tokens $i --skip-autotune
done

yzh119

LGTM, thank you @kaixih !

gemini-code-assist bot reviewed Jul 23, 2025

View reviewed changes

kaixih force-pushed the autotune_json branch from e3b39a4 to f824bd1 Compare July 23, 2025 19:03

kaixih changed the title ~~[Draft] Cache autotuned results to json for cutlass fp4 moe~~ Support loading autotuned results from json for cutlass fp4 moe backends Jul 23, 2025

kaixih force-pushed the autotune_json branch from c506af6 to 8edaa96 Compare July 23, 2025 23:31

yzh119 reviewed Jul 25, 2025

View reviewed changes

flashinfer/configs/0.2.8/trtllm_fused_moe_NVIDIA_B200.json Outdated Show resolved Hide resolved

benchmarks/bench_cutlass_fused_moe.py Outdated Show resolved Hide resolved

Support loading autotuned results from json for cutlass fp4 backend

5a4743f

kaixih force-pushed the autotune_json branch from 73a075c to ca108ec Compare July 25, 2025 21:12

Use python obj instead of json

c2dfcbc

kaixih force-pushed the autotune_json branch from e609ee2 to c2dfcbc Compare July 25, 2025 21:40

yzh119 reviewed Jul 29, 2025

View reviewed changes

Format change

fc013fe

yzh119 approved these changes Jul 29, 2025

View reviewed changes

yzh119 merged commit bc1a041 into flashinfer-ai:main Jul 29, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support loading autotuned results from json for cutlass fp4 moe backends #1310

Support loading autotuned results from json for cutlass fp4 moe backends #1310

Uh oh!

kaixih commented Jul 23, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzh119 left a comment

Uh oh!

Uh oh!

Uh oh!

wenscarl commented Jul 25, 2025

Uh oh!

kaixih commented Jul 25, 2025

Uh oh!

kaixih commented Jul 25, 2025

Uh oh!

yzh119 left a comment

Uh oh!

yzh119 Jul 29, 2025

Uh oh!

kaixih Jul 29, 2025

Uh oh!

kaixih commented Jul 29, 2025

Uh oh!

yzh119 left a comment

Uh oh!

Uh oh!

Uh oh!

-    print("input\tweight1\tweight2\ttime(ms)")
+    print(f"{'input':<15} {'weight1':<20} {'weight2':<20} {'time(ms)'}")
+    print(
+        f"{str(tuple(hidden_states.shape)):<15} {str(tuple(w1.shape)):<20} {str(tuple(w2.shape)):<20} {avg_ms:.3f}"
+    )

Support loading autotuned results from json for cutlass fp4 moe backends #1310

Support loading autotuned results from json for cutlass fp4 moe backends #1310

Uh oh!

Conversation

kaixih commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wenscarl commented Jul 25, 2025

Uh oh!

kaixih commented Jul 25, 2025

Uh oh!

kaixih commented Jul 25, 2025

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

kaixih Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

kaixih commented Jul 29, 2025

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kaixih commented Jul 23, 2025 •

edited

Loading