Skip to content

Conversation

@zheliuyu
Copy link
Contributor

@zheliuyu zheliuyu commented Nov 8, 2025

What does this PR do?

As title.

Test script

from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import time


# Set the level to `DEBUG` to see which kernels are being called.
# logging.basicConfig(level=logging.DEBUG)

model_name = "Qwen/Qwen3-0.6B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    use_kernels=True,
)

# prepare the model input
prompt = "Output the first 20 digits of pi."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Print Runtime
start_time = time.time()
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
print("runtime: ", time.time()-start_time)

output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

print("content:", content)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant