Adding MLPSpeculator support for assisted generation #37225

sahilsuneja1 · 2025-04-02T23:09:21Z

What does this PR do?

This PR adds support to use MLPSpeculator models for assisted generation, similar to it's support in TGI and vLLM

Model code originally authored by Davis Wertheimer @daviswer

List of already existing speculators here and here

Training recipes new speculators here and here

Usage example:

import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, MLPSpeculatorPreTrainedModel

def compare_assisted_generation(prompts, checkpoint, assistant_checkpoint):
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    inputs = tokenizer(prompts, return_tensors="pt").to(device=device)

    model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device=device, dtype=torch.bfloat16)
    assistant_model = MLPSpeculatorPreTrainedModel.from_pretrained(assistant_checkpoint).to(device=device, dtype=torch.bfloat16)
    model.eval()
    assistant_model.eval()

    if model.generation_config.pad_token_id is None:
        model.generation_config.pad_token_id = model.generation_config.eos_token_id

    generate_kwargs = {
        "do_sample":False,
        "temperature":None,
        "max_new_tokens":50,
        "output_hidden_states":True,
    }

    # warmup
    for _ in range(0,2):
        model.generate(**inputs, **generate_kwargs)
        model.generate(**inputs,  assistant_model=assistant_model, **generate_kwargs)

    start_time = time.time()
    outputs = model.generate(**inputs, **generate_kwargs)
    end_time = time.time()
    print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
    print(f"Generation without assistant; Time taken: {end_time-start_time} seconds")

    start_time = time.time()
    outputs = model.generate(**inputs,  assistant_model=assistant_model, **generate_kwargs)
    end_time = time.time()
    print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
    print(f"Generation with assistant; Time taken: {end_time-start_time} seconds")


torch.set_grad_enabled(False)
prompt = "Alice and Bob"
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
speculator_checkpoint = "ibm-ai-platform/llama3-8b-accelerator"
compare_assisted_generation(prompt, checkpoint, speculator_checkpoint)

Output from the above example on A100:

['Alice and Bob are two friends who are trying to solve a puzzle. They are given a set of numbers, and they need to find the sum of the numbers that are multiples of 3 or 5.\n\nHere is the set of numbers: 1, ']
Generation without assistant; Time taken: 1.150806188583374 seconds
['Alice and Bob are two friends who are trying to solve a puzzle. They are given a set of numbers, and they need to find the sum of the numbers that are multiples of 3 or 5.\n\nHere is the set of numbers: 1, ']
Generation with assistant; Time taken: 0.6626832485198975 seconds

Who can review?

@gante

Signed-off-by: Sahil Suneja <sahilsuneja@gmail.com> MLPSPeculator originally authored by Davis Wertheimer at: https://github.com/foundation-model-stack/fms-extras/blob/main/fms_extras/models/speculator.py

github-actions · 2025-04-02T23:09:33Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

gante · 2025-04-03T09:42:27Z

Hey @sahilsuneja1 👋

We're currently pausing the addition of all non-critical decoding methods, including assisted generation variations. This is because we're designing a new way of adding transformers-compatible decoding methods (see this draft PR)

TL;DR, if the plan goes forward, new decoding methods will live on the hub, and transformers will only hold the core decoding strategies 🤗

Signed-off-by: Sahil Suneja <sahilsuneja@gmail.com>

sahilsuneja1 · 2025-04-03T14:11:42Z

Thanks @gante, will track progress on it and revisit when the change is made!

Signed-off-by: Sahil Suneja <sahilsuneja@gmail.com>

adding mlpspeculator support for assisted generation

16cfa70

Signed-off-by: Sahil Suneja <sahilsuneja@gmail.com> MLPSPeculator originally authored by Davis Wertheimer at: https://github.com/foundation-model-stack/fms-extras/blob/main/fms_extras/models/speculator.py

github-actions bot marked this pull request as draft April 2, 2025 23:09

sahilsuneja1 marked this pull request as ready for review April 2, 2025 23:12

github-actions bot requested review from ArthurZucker and Rocketknight1 April 2, 2025 23:13

gante requested review from gante and removed request for ArthurZucker and Rocketknight1 April 3, 2025 09:37

styling fixes

39b44bf

Signed-off-by: Sahil Suneja <sahilsuneja@gmail.com>

styling fixes

965d8ae

Signed-off-by: Sahil Suneja <sahilsuneja@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding MLPSpeculator support for assisted generation #37225

Adding MLPSpeculator support for assisted generation #37225

Uh oh!

sahilsuneja1 commented Apr 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 2, 2025

Uh oh!

gante commented Apr 3, 2025

Uh oh!

sahilsuneja1 commented Apr 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding MLPSpeculator support for assisted generation #37225

Are you sure you want to change the base?

Adding MLPSpeculator support for assisted generation #37225

Uh oh!

Conversation

sahilsuneja1 commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Uh oh!

github-actions bot commented Apr 2, 2025

Uh oh!

gante commented Apr 3, 2025

Uh oh!

sahilsuneja1 commented Apr 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sahilsuneja1 commented Apr 2, 2025 •

edited

Loading