Skip to content

Add --first-last-layers-bf16#614

Open
zianglih wants to merge 1 commit intoradixark:mainfrom
zianglih:mxfp8-last-n
Open

Add --first-last-layers-bf16#614
zianglih wants to merge 1 commit intoradixark:mainfrom
zianglih:mxfp8-last-n

Conversation

@zianglih
Copy link
Contributor

@zianglih zianglih commented Feb 17, 2026

@HumansAnd

Motivation

As mentioned in https://arxiv.org/abs/2509.25149:

Keep a few sensitive linear layers in higher precision

is helpful for training convergence.

Dependency

sgl-project/sglang#18742 for serving mixed-precision checkpoints

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zianglih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the model quantization pipeline by introducing the ability to perform partial MXFP8 quantization. Specifically, it allows users to designate a certain number of the last decoder layers to retain their original BF16 precision, while the rest of the model is converted to MXFP8. This feature provides finer-grained control over model precision, potentially balancing performance and accuracy by preserving higher precision in critical layers.

Highlights

  • Partial Quantization Control: Introduced a new mechanism to selectively skip MXFP8 quantization for a specified number of final decoder layers, allowing them to remain in BF16 precision.
  • New Argument --first-last-layers-bf16: Added the --first-last-layers-bf16 flag and a corresponding --num-layers-at-end-in-bf16 argument to control this behavior in the model conversion and execution scripts.
  • Dynamic Skip Logic: Implemented dynamic calculation of layer prefixes to skip during quantization based on the total number of hidden layers and the specified number of BF16 layers at the end.
  • Improved Conversion Script: The convert_hf_to_mxfp8.py script was updated to read model configuration (like num_hidden_layers) and pass layer-specific parameters to the quantization logic.
Changelog
  • miles/backends/megatron_utils/megatron_to_hf/processors/quantizer_mxfp8.py
    • Implemented conditional logic to skip MXFP8 quantization for main decoder layers based on first_last_layers_bf16 and num_layers_at_end_in_bf16 arguments.
    • Added assertions to ensure num_layers is set and positive when first_last_layers_bf16 is enabled.
  • scripts/run_qwen3_30b_a3b.py
    • Introduced num_layers_at_end_in_bf16 as a new ScriptArgs parameter with a default of 0.
    • Added validation to ensure num_layers_at_end_in_bf16 is only used when rollout_mxfp8 is enabled.
    • Modified the convert_hf_to_mxfp8.py command to pass the num_layers_at_end_in_bf16 argument.
    • Added --first-last-layers-bf16 and related arguments to the misc_args for the execution command when train_fp8 or train_mxfp8 is enabled and num_layers_at_end_in_bf16 is greater than 0.
  • tools/convert_hf_to_mxfp8.py
    • Imported the re module for regular expression operations.
    • Modified the should_quantize function to accept an optional skip_weight_substrings argument, allowing dynamic control over which weights to skip.
    • Updated the process_file function signature to accept num_hidden_layers and num_layers_at_end_in_bf16.
    • Implemented logic within process_file to dynamically determine layer prefixes to skip based on num_layers_at_end_in_bf16 and passed these to should_quantize.
    • Modified the convert_mxfp8 function signature to accept num_layers_at_end_in_bf16 and read num_hidden_layers from config.json.
    • Updated the call to process_file within convert_mxfp8 to pass the new layer-related arguments.
    • Added a natural_key function for sorting modules_to_not_convert to ensure numerical order.
    • Added a command-line argument --num-layers-at-end-in-bf16 to the main function's argument parser.
    • Passed the num_layers_at_end_in_bf16 argument from the command line to the convert_mxfp8 function.
Activity
  • The pull request was opened by zianglih.
  • The motivation for this change is to enable more flexible quantization strategies.
  • A dependency on sglang/pull/18742 was noted.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to skip quantization for the last N layers of a model, keeping them in BF16 format. The changes span across model conversion scripts and training execution scripts to support this new functionality, primarily for MXFP8 quantization. My review identifies a logical contradiction in the feature's activation condition, a magic number that hurts maintainability, and some redundant and non-robust file handling. Addressing these points will improve the code's correctness and quality.

Comment on lines +190 to +195
if (args.train_fp8 or args.train_mxfp8) and args.num_layers_at_end_in_bf16 > 0:
misc_args += (
"--first-last-layers-bf16 "
"--num-layers-at-start-in-bf16 0 "
f"--num-layers-at-end-in-bf16 {args.num_layers_at_end_in_bf16} "
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a logical contradiction regarding when this feature is enabled. The __post_init__ check on line 40 asserts that num_layers_at_end_in_bf16 is only supported when rollout_mxfp8 is enabled. However, this block enables the feature for both train_fp8 and train_mxfp8. If train_fp8 is used (which usually implies rollout_fp8 and not rollout_mxfp8), the assertion on line 40 will fail, making the feature unusable with train_fp8.

Given that other changes in this PR are specific to mxfp8, it seems this feature is intended only for mxfp8. If so, the condition should be narrowed to resolve the contradiction.

Suggested change
if (args.train_fp8 or args.train_mxfp8) and args.num_layers_at_end_in_bf16 > 0:
misc_args += (
"--first-last-layers-bf16 "
"--num-layers-at-start-in-bf16 0 "
f"--num-layers-at-end-in-bf16 {args.num_layers_at_end_in_bf16} "
)
if args.train_mxfp8 and args.num_layers_at_end_in_bf16 > 0:
misc_args += (
"--first-last-layers-bf16 "
"--num-layers-at-start-in-bf16 0 "
f"--num-layers-at-end-in-bf16 {args.num_layers_at_end_in_bf16} "
)

Comment on lines +117 to +120
num_maybe_mtp_layers = 1
dynamic_skip_layer_prefixes: set[str] = {
f"model.layers.{i}." for i in range(tail_start_idx, num_hidden_layers + num_maybe_mtp_layers)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of the magic number 1 for num_maybe_mtp_layers makes the code less readable and harder to maintain. It's not immediately clear why this value is 1 and if it's model-specific.

To improve clarity and maintainability, please define this as a named constant with a comment explaining its purpose. For example:

# Number of MTP (Mixture of Transformer Parallel) layers to account for, which might not be included in `num_hidden_layers`.
# This can be model-specific.
NUM_MAYBE_MTP_LAYERS = 1
# ...
# ... range(tail_start_idx, num_hidden_layers + NUM_MAYBE_MTP_LAYERS)

Comment on lines +154 to +157
config_path = os.path.join(input_path, "config.json")
with open(config_path) as f:
cfg = json.load(f)
num_hidden_layers = int(cfg["num_hidden_layers"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block reads config.json, but the same file is read again at line 198, which is redundant. There are also inconsistencies in file handling throughout the function:

  • The open call here at line 155 is not protected by os.path.exists, which will cause a crash if the file is missing. The later read at line 198 is protected.
  • File operations at lines 198 and 200 do not use a with statement, which is not best practice.

Please consider refactoring to load the config once at the start of the function, and use with statements for all file I/O to ensure robustness and proper resource management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants