Skip to content

imatrix: optionally activate MTP/NextN draft head during collection#23476

Draft
mudler wants to merge 1 commit into
ggml-org:masterfrom
mudler:mtp-imatrix
Draft

imatrix: optionally activate MTP/NextN draft head during collection#23476
mudler wants to merge 1 commit into
ggml-org:masterfrom
mudler:mtp-imatrix

Conversation

@mudler
Copy link
Copy Markdown
Contributor

@mudler mudler commented May 21, 2026

Overview

llama-imatrix only runs forward passes through the trunk, so MTP draft head tensors (blk.<n_layer>.nextn.eh_proj etc., added by #22673) never receive activations and have no imatrix data. Low-bit i-quants for those tensors then fail at quantize-time:

llama_model_quantize: failed to quantize: Missing importance matrix for
tensor blk.40.nextn.eh_proj.weight in a very low-bit quantization

This adds an opt-in --mtp flag to llama-imatrix. When set and the loaded model has MTP/NextN layers, a second llama_context is created with ctx_type = LLAMA_CONTEXT_TYPE_MTP. After each trunk sub-batch decode, the trunk's pre-norm hidden states are paired with the next-token ids and decoded through the MTP context, mirroring how common_speculative_state_draft_mtp::process() invokes the head during real spec decoding. MTP-layer tensors then land in the same imatrix collector via the existing eval callback.

Default behavior unchanged. No-op (with warning) for models without MTP layers. Currently restricted to n_seq == 1 to keep MTP-row-to- output-row mapping unambiguous; warns and disables itself otherwise.

Adds a small public accessor llama_model_n_nextn(model) so callers outside src/ can probe MTP presence without pulling in llama_hparams.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. I've been using Claude to drive this PR as it seemed to be very well scoped and review should be doable as its not a large chunk of code involved.

`llama-imatrix` only runs forward passes through the trunk, so MTP draft
head tensors (`blk.<n_layer>.nextn.eh_proj` etc., added by ggml-org#22673) never
receive activations and have no imatrix data. Low-bit i-quants for those
tensors then fail at quantize-time:

  llama_model_quantize: failed to quantize: Missing importance matrix for
  tensor blk.40.nextn.eh_proj.weight in a very low-bit quantization

This adds an opt-in `--mtp` flag to `llama-imatrix`. When set and the
loaded model has MTP/NextN layers, a second `llama_context` is created
with `ctx_type = LLAMA_CONTEXT_TYPE_MTP`. After each trunk sub-batch
decode, the trunk's pre-norm hidden states are paired with the next-token
ids and decoded through the MTP context, mirroring how
`common_speculative_state_draft_mtp::process()` invokes the head during
real spec decoding. MTP-layer tensors then land in the same imatrix
collector via the existing eval callback.

Default behavior unchanged. No-op (with warning) for models without
MTP layers. Currently restricted to `n_seq == 1` to keep MTP-row-to-
output-row mapping unambiguous; warns and disables itself otherwise.

Adds a small public accessor `llama_model_n_nextn(model)` so callers
outside `src/` can probe MTP presence without pulling in `llama_hparams`.

Files:
  common/arg.cpp            +9   --mtp CLI option
  common/common.h           +1   imat_mtp on common_params
  include/llama.h           +4   llama_model_n_nextn() decl
  src/llama-model.cpp       +4   llama_model_n_nextn() impl
  tools/imatrix/imatrix.cpp +125 MTP context + per-batch MTP forward pass
@CISC
Copy link
Copy Markdown
Member

CISC commented May 21, 2026

Overlapping #23258

@mudler
Copy link
Copy Markdown
Contributor Author

mudler commented May 21, 2026

Overlapping #23258

ouch, didn't saw it - sorry. feel free to close anytime. I'd be ok to pick it up in case the other one doesn't land.

@CISC
Copy link
Copy Markdown
Member

CISC commented May 21, 2026

Overlapping #23258

ouch, didn't saw it - sorry. feel free to close anytime. I'd be ok to pick it up in case the other one doesn't land.

It seems you have differing approaches, perhaps worth looking into if something can be combined and/or improved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants