prefill decode microbenchmark for QWen3 #699

mailvijayasingh · 2025-09-16T05:24:59Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

github-actions · 2025-09-16T05:25:15Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.
I have reviewed the uLLM modeling code checklist: https://docs.google.com/document/d/1DGQBVvr2bh4G8tBUO1YH8pO7Dd_myw5rfEMfVDymEk8/edit?resourcekey=0-V7MGHu3aQjJH6YrI3-y8Hg&tab=t.t91cyovog2mr#heading=h.cqdzv8mlszca
I have received at least 1 readability approval and 1 correctness approval.

jrplatin · 2025-09-19T22:24:48Z

examples/microbenchamarking/README.md

Can you please add a description of all the args you can pass to the MB script and what they are?

you mean in README? Sure!

examples/microbenchamarking/microbenchmark_app.py

jrplatin · 2025-09-19T22:26:24Z

examples/microbenchamarking/microbenchmark_app.py

+NEW_MODEL_DESIGN = flags.DEFINE_string(
+    "NEW_MODEL_DESIGN",
+    "True",
+    "Model design to use. If True, uses the new model design.",


Maybe note that this is only needed for a few models right now (L4, DSv3)

jrplatin · 2025-09-19T22:26:44Z

examples/microbenchamarking/microbenchmark_app.py

+    if model_name == "qwen3-32b":
+        return qwen3_32b_hf_config
+    elif model_name == "deepseek_v3":
+        return deepseek_v3_hf_config


I thought the README said only Qwen3 is supported?

yeah, I just checked for DeepSeek, and it just works out of the box, so will keep it for both.

jrplatin · 2025-09-19T22:33:52Z

examples/microbenchamarking/microbenchmark_utils.py

+from tpu_commons.utils import make_optimized_mesh
+
+logger = init_logger(__name__)
+power_of_two = np.pow(2, np.arange(18))  # up to 128k seq lens


I think this code is also called somewhere else -- maybe move to common utils file?

jrplatin · 2025-09-19T22:34:03Z

examples/microbenchamarking/microbenchmark_utils.py

+
+def init_mesh(vllm_config, devices) -> None:
+    try:
+        # TODO: Update override steps.


Not going to be addressed here?

tpu_commons/mock/vllm_config_utils.py

jrplatin · 2025-09-19T22:34:41Z

tpu_commons/mock/vllm_config_utils.py

    num_kv_heads: int = 32
    head_dim: int = 128
    vocab_size: int = 32000
-    model: str = "llama3"


I see you're updating the defaults here -- did you test this won't break anything?

have verified it, but we can check if having this is necessary.

Actually, model config in this format is needed only for new model design and is not needed for consolidated model code. We can remove it or modify it once everything is consolidated

jrplatin · 2025-09-19T22:35:29Z

tpu_commons/models/jax/model_loader.py

-        head_size = model_config.get_head_size()
-        num_kv_heads = model_config.get_total_num_kv_heads()
+        hf_config = vllm_config.model_config.hf_config
+        head_size = hf_config.head_dim


Sorry, I realized I said this was safe offline, but can you double check this won't break anything for Llama3 / Qwen3 / DeepSeek?

Sure, I'll check offline_inference.py. Will that be a good check?

Actually, I will remove it for now, github does not need it. g3 needs it. I can handle this in next PR

kyuyeunk

I've tried this out, but I think few things can be improved from user-journey perspective

Running the benchmark doesn't really return anything on the console. It took me a while to realize that it was saving a profile data in directory specified by argument trace_dir
Printing even just simple the run took xyz seconds to complete to the console would be super useful. Trying to dig through trace file every time I run the microbenchmark isn't ideal.
For better workflow, I wish it could automatically create xprof link. I believe it's not possible due to technical limitation, but I can imagine following features
1. Add argument specifying gs bucket directory (e.g., --gs_dir=gs://...)
2. Microbenchmark automatically uploads trace files to the user specified gs bucket and remember the trace file's path in gs bucket. (e.g., gs://...xplan.pb
3. In the console, microbenchmark prints out commands user need to run in g3 workspace to create xprof linke (e.g., To create xprof link, run following command in g3: blaze run -c opt //cloud/tpu/tools/c2xprof:main -- --alsologtostderr --gcs_path=gs://...xplane.pb

Few additional minor comments

Left a comment in the file, but creating a config for every model we want to support isn't scalable. Please consider reusing a logic that fetches config.json from hugging face and create a model config.
No support for vllm model?
This branch didn't work out-of-the-box and I had to make some fixes in microbenchmark_input_utils.py
Is there a number that you can share that the number returned here closely matches the number in the e2e run?

tpu_commons/mock/vllm_config_utils.py

examples/microbenchamarking/microbenchmark_app.py

Signed-off-by: Vijaya singh <singhvijaya@google.com>

gpolovets1 · 2025-09-25T03:25:49Z

examples/microbenchamarking/README.md

@@ -0,0 +1,86 @@
+# MICROBENCHAMRKING IS EXPERIMENTAL AND NOT SUPPORTED FOR ALL MODELS AND FLEXIBLE WORKLOADS
+
+The Goal of microbenchmarking is to strip the model call from VLLM Dependencies (Scheduler and KV Cache Manager) for efficient debugging and performance optimization of just model call.


nit: VLLM -> vLLM

gpolovets1 · 2025-09-25T03:30:31Z

examples/microbenchamarking/README.md

@@ -0,0 +1,86 @@
+# MICROBENCHAMRKING IS EXPERIMENTAL AND NOT SUPPORTED FOR ALL MODELS AND FLEXIBLE WORKLOADS
+
+The Goal of microbenchmarking is to strip the model call from VLLM Dependencies (Scheduler and KV Cache Manager) for efficient debugging and performance optimization of just model call.


A rewording suggestion:
"The goal of microbenchmarking is to strip out the vLLM server layer and focus on just profiling the model calls."

gpolovets1 · 2025-09-25T03:31:56Z

examples/microbenchamarking/README.md

+
+The Goal of microbenchmarking is to strip the model call from VLLM Dependencies (Scheduler and KV Cache Manager) for efficient debugging and performance optimization of just model call.
+
+The current version is ** working on pinned main **


"The current implementation runs on the following pinned version of the main branch:"

Actually, is not pinned anymore, I will just mention it so we can backtrack, but as long as model call API remains unchanged, it should work

gpolovets1 · 2025-09-25T03:33:14Z

examples/microbenchamarking/README.md

+
+> ⚠️ The microbenchmarking code **does not support all models and features and is currently used for debugging and optimizing static workloads
+
+**Only tested model for microbenchmarking is QWEN3-32B**


"The only model validated for microbenchmarking is Qwen3-32B."

Don't we support DeepSeek as well?

I wanted to keep this, and remove once we verify the runs together

gpolovets1 · 2025-09-25T03:35:20Z

examples/microbenchamarking/README.md

+## Params needed by microbenchmarking code
+
+### `max_seq_len` -
+ max model len this is length of the model including number of prefill and decode tokens


"max model len this is the maximum supported length of each request. Typically this equals the maximum number of prefill + decode tokens across all requests."

gpolovets1 · 2025-09-25T03:35:56Z

examples/microbenchamarking/README.md

+
+### `phase` -
+
+phase of the model, supported modes are prefill and decode


"Inference phase - supported phases are 'prefill' and 'decode'."

gpolovets1 · 2025-09-25T03:37:30Z

examples/microbenchamarking/README.md

+phase of the model, supported modes are prefill and decode
+
+### `decode_offset_from_prefill` -
+used in decode primarily, if the value is 1, it means 1st token after prefill


"This offset indicates the decode step index to profile. E.g. setting a value of 10 corresponds to profiling the 10th decode step."

gpolovets1 · 2025-09-25T03:38:09Z

examples/microbenchamarking/README.md

+used in decode primarily, if the value is 1, it means 1st token after prefill
+
+### `model_hf_config` -
+path to json file where HFConfig is saved. We need this because we dont want to download from huggingface.


"We need this to avoid having to download the model from huggingface everytime."

Actually, if we are just downloading the config, would it be a problem to download from HF?

gpolovets1 · 2025-09-25T04:51:12Z

examples/microbenchamarking/README.md

+max length of prefill sequence
+
+### `max_num_sequence` -
+is the maximum number of sequence supported by model.


sequence -> sequences

gpolovets1 · 2025-09-25T04:54:33Z

examples/microbenchamarking/README.md

+
+i) In Prefill phase - `max_num_sequence` = max_seq_len // max_prefill_len
+
+ii) In Decode phase - `max_num_sequence` < `max_seq_len`


Why does the maximum sequence length influence the maximum number of allowed sequences (and vice versa)?

Would it help if we added --max-num-batched-tokens from vLLM? This variable corresponds to the maximum total tokens that we can process in a single batch.

gpolovets1 · 2025-09-25T04:55:42Z

examples/microbenchamarking/README.md

+or same as `page_size` for KV Cache
+
+### `additional_config` -
+example of additional config


"This is used to propagate tpu_commons-specific arguments (e.g. sharding and quantization settings)."

gpolovets1 · 2025-09-25T04:56:35Z

examples/microbenchamarking/README.md

+### `model_config` -
+--model_config='{"model":"Qwen/Qwen3-32B"}'
+
+### `new_model_design` -


Are we using this? it seems like you are setting this via env variables in your code.

gpolovets1 · 2025-09-25T04:58:17Z

examples/microbenchamarking/README.md

+
+local location where traces are stored. Default value is `/tmp/tpu_commons_traces`
+
+## Example command to run Microbenchmark


Thanks for adding these!

gpolovets1 · 2025-09-25T05:14:31Z

examples/microbenchamarking/microbenchmark_app.py

+
+_DECODE_OFFSET_FROM_PREFILL = flags.DEFINE_integer(
+    "decode_offset_from_prefill",
+    0,


According to the readme, does offset of 0 correspond to the last prefill token?

At least set it to 1.

gpolovets1 · 2025-09-25T05:15:20Z

examples/microbenchamarking/microbenchmark_app.py

+
+# this has to be overriden as the calculation is not very correct yet on microbenchmark side.
+#TODO: @(vijaya) Fix the calculation and remove this flag as an override.
+_KV_NUM_BLOCK_OVERRIDE = flags.DEFINE_integer(


Can you share some guidelines in the readme for how you are calculating this?

So far running on v7x-2 (depending on how much TPU memory you have right now) to determine the number of total blocks.

gpolovets1 · 2025-09-25T05:19:45Z

examples/microbenchamarking/microbenchmark_app.py

+    "Model configuration for the model.",
+)
+
+NEW_MODEL_DESIGN = flags.DEFINE_string(


Are we using this?

gpolovets1 · 2025-09-25T05:21:31Z

examples/microbenchamarking/microbenchmark_app.py

+)
+
+
+def get_hf_config_attribute_map(model_hf_config: str):


nit: maybe creating this function is overkill? It's just one extra line of code to convert the string to a json =]

gpolovets1 · 2025-09-25T06:24:47Z

examples/microbenchamarking/microbenchmark_app.py

+        end_time = time.time()
+        jax.profiler.stop_trace()
+        logger.info(
+            f"Time taken for model call in phase {phase}: {end_time - start_time} seconds. and profile trace is saved in {self.trace_directory}"


"seconds. and" -> "seconds\nProfile"

gpolovets1 · 2025-09-25T06:28:01Z

examples/microbenchamarking/microbenchmark_utils.py

+
+        axis_names = ("data", "model")
+        mesh_shape = (dp, tp)
+        # for deepseekv3


Since deepseek is a new model, I think this is not needed/will be skipped?

gpolovets1 · 2025-09-25T06:28:45Z

examples/microbenchamarking/microbenchmark_utils.py

+    type: str
+    std: float = None
+
+    def generate_samples(self, shape: Tuple[int], fill_val: Any) -> np.array:


Is this used anywhere?

gpolovets1 · 2025-09-25T06:29:32Z

examples/microbenchamarking/microbenchmark_app.py

+        self.vllm_config = vllm_config
+        self.model = model
+        self.mesh = mesh
+        self.sampler = sampler


gpolovets1 · 2025-09-25T16:07:53Z

examples/microbenchamarking/microbenchmark_input_utils.py

+            ])
+
+    def _create_mock_block_table(self, random_permute: bool = False):
+        block_table = np.arange(self.input_args.num_blocks_override,


Are we assuming that the KV cache is always full? My understanding is that the block table tells you which blocks to use in a batch but it can technically be less than the full KV cache buffer.

Unless the block_table is padded to represent a non-full table.

Implementation is different with tpu_commons.

gpolovets1 · 2025-09-25T16:09:07Z

examples/microbenchamarking/microbenchmark_input_utils.py

+@dataclass
+class InputArgs:
+    max_num_seq: int
+    max_prefill_len: int


This looks like a duplicate.

gpolovets1 · 2025-09-25T16:20:38Z

examples/microbenchamarking/microbenchmark_input_utils.py

+    In prefill phase, all sequences are of length max_prefill_len"""
+    if phase == 'decode':
+        # this means all sequences are in decode phase
+        return np.random.randint(1, vocab_size, size=max_num_sequence)


nit: this is probably between 0 and vocab_size-1

gpolovets1 · 2025-09-25T16:22:42Z

examples/microbenchamarking/microbenchmark_input_utils.py

+                            phase='decode') -> jnp.ndarray:
+    """
+    Creates sequence lengths based on phase.
+    In decode phase, all sequences are of length offset_from_prefill (usually 1) + max_prefill_len -1


Isn't offset_from_prefill configurable now?

gpolovets1 · 2025-09-25T16:24:55Z

examples/microbenchamarking/microbenchmark_input_utils.py

+    elif phase == 'prefill':
+        # this means all sequences are in prefill phase with same prefill length which is max_prefill_len
+        return np.random.randint(max_prefill_len,
+                                 max_prefill_len + 1,


Why is it max_prefill_len+1?

gpolovets1 · 2025-09-25T16:26:17Z

tpu_commons/models/vllm/sharding.py

Is this deprecated?

gpolovets1 · 2025-09-25T16:34:24Z

examples/microbenchamarking/microbenchmark_input_utils.py

+    """
+    if phase == 'prefill':
+        query_start_offsets = np.zeros(max_num_seq + 1, dtype=np.int32)
+        query_start_offsets[0] = 0


This is already true from line 249

gpolovets1 · 2025-09-25T16:35:58Z

examples/microbenchamarking/microbenchmark_input_utils.py

+                           offset_from_prefill: int = 1,
+                           phase='decode') -> jnp.ndarray:
+    """
+    Creates input positions based on phase.


Could you explain what input positions is in this context?

How about mentioning it is the input position index?

gpolovets1 · 2025-09-25T16:38:04Z

examples/microbenchamarking/microbenchmark_input_utils.py

+    """
+    if phase == 'decode':
+        # in decode phase, all sequences are of length 1 and the position is immediately after the prefill length
+        return np.full((max_num_seqs, ),


Shouldn't this be the sum(seq_lens) (i.e. total scheduled tokens)?

gpolovets1 · 2025-09-25T16:40:22Z

examples/microbenchamarking/microbenchmark_input_utils.py

+
+def create_num_blocks(max_model_len: int,
+                      block_size: int,
+                      num_block_override=0) -> int:


Maybe you should not set a default for num_block_override if right now the implementation requires setting this? You can add a TODO instead to make it dynamic in the future.

gpolovets1 · 2025-09-25T16:42:11Z

examples/microbenchamarking/microbenchmark_input_utils.py

+
+    if phase == 'decode':
+        # this means all sequences are in decode phase
+        return np.array([max_num_sequence, max_num_sequence, max_num_sequence],


Whey are there max_num_sequence prefill sequences during deocde?

gpolovets1 · 2025-09-25T16:42:58Z

examples/microbenchamarking/microbenchmark_input_utils.py

+                        dtype=np.int32)
+    elif phase == 'prefill':
+        # this means all sequences are in prefill phase
+        return np.array([0, 0, max_num_sequence], dtype=np.int32)


According to the docstring, this is setting both prefill & decode to 0 but total number of sequences to max_num_sequence

gpolovets1 · 2025-09-25T18:14:02Z

examples/microbenchamarking/microbenchmark_input_utils.py

+    """
+    Creates request distribution based on phase.
+    request_distribution is of shape (3,) where
+    request_distribution[0] = number of sequences in decode phase


Can you update the documentation here? We can also write that empirically this has been confirmed but a TODO is to review the tpu_commons code one more time and confirm.

mailvijayasingh force-pushed the wip_check_vij branch 8 times, most recently from 130786a to cca99e4 Compare September 19, 2025 22:29

jrplatin reviewed Sep 19, 2025

View reviewed changes

mailvijayasingh force-pushed the wip_check_vij branch from cca99e4 to 52841ee Compare September 22, 2025 20:48

kyuyeunk reviewed Sep 23, 2025

View reviewed changes

tpu_commons/mock/vllm_config_utils.py Show resolved Hide resolved

examples/microbenchamarking/microbenchmark_app.py Show resolved Hide resolved

mailvijayasingh force-pushed the wip_check_vij branch from 52841ee to ae522a8 Compare September 25, 2025 00:14

microbenchmarking

7797944

Signed-off-by: Vijaya singh <singhvijaya@google.com>

mailvijayasingh force-pushed the wip_check_vij branch from ae522a8 to 7797944 Compare September 25, 2025 00:26

kyuyeunk approved these changes Sep 25, 2025

View reviewed changes

gpolovets1 reviewed Sep 25, 2025

View reviewed changes

tpu_commons/models/vllm/sharding.py

Copy link

Collaborator

gpolovets1 Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this deprecated?

gpolovets1 reviewed Sep 25, 2025

View reviewed changes

		@@ -0,0 +1,86 @@
		# MICROBENCHAMRKING IS EXPERIMENTAL AND NOT SUPPORTED FOR ALL MODELS AND FLEXIBLE WORKLOADS

		The Goal of microbenchmarking is to strip the model call from VLLM Dependencies (Scheduler and KV Cache Manager) for efficient debugging and performance optimization of just model call.


		The Goal of microbenchmarking is to strip the model call from VLLM Dependencies (Scheduler and KV Cache Manager) for efficient debugging and performance optimization of just model call.

		The current version is working on pinned main


		> ⚠️ The microbenchmarking code **does not support all models and features and is currently used for debugging and optimizing static workloads

		Only tested model for microbenchmarking is QWEN3-32B


		### `phase` -

		phase of the model, supported modes are prefill and decode


		i) In Prefill phase - `max_num_sequence` = max_seq_len // max_prefill_len

		ii) In Decode phase - `max_num_sequence` < `max_seq_len`


		local location where traces are stored. Default value is `/tmp/tpu_commons_traces`

		## Example command to run Microbenchmark

prefill decode microbenchmark for QWen3 #699

Are you sure you want to change the base?

prefill decode microbenchmark for QWen3 #699

Uh oh!

Conversation

mailvijayasingh commented Sep 16, 2025

Description

Tests

Checklist

Uh oh!

github-actions bot commented Sep 16, 2025 • edited by mailvijayasingh Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyuyeunk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mailvijayasingh Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gpolovets1 Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 16, 2025 •

edited by mailvijayasingh

Loading

mailvijayasingh Sep 25, 2025 •

edited

Loading

gpolovets1 Sep 25, 2025 •

edited

Loading

gpolovets1 Sep 25, 2025 •

edited

Loading