Skip to content

Isolated reproduction of https://github.com/huggingface/transformers/issues/38071 #43906

@willxxy

Description

@willxxy

System Info

name = "accelerate"
version = "1.12.0"
name = "transformers"
version = "4.57.3"

Python 3.11

Who can help?

@gante @ArthurZucker Related to warning from #38071 for Qwen/Qwen3-Next-80B-A3B-Instruct model

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import pipeline
from tqdm import tqdm
LANGUAGE_MODEL = "Qwen/Qwen3-Next-80B-A3B-Instruct"
def main():

    pipe = pipeline("text-generation", model=LANGUAGE_MODEL, device_map="auto")
    messages = [[
        {"role": "system", "content": "hi"},
        {"role": "user", "content": "hdi"},
    ],[
        {"role": "system", "content": "hi"},
        {"role": "user", "content": "hddi"},
    ],[
        {"role": "system", "content": "hi"},
        {"role": "user", "content": "hasdasi"},
    ],[
        {"role": "system", "content": "hi"},
        {"role": "user", "content": "hiasdsad"},
    ],[
        {"role": "system", "content": "hi"},
        {"role": "user", "content": "hiasd"},
    ],[
        {"role": "system", "content": "hi"},
        {"role": "user", "content": "hiasd"},
    ]]

    BATCH_SIZE = 3
    for out in tqdm(
        pipe(messages, max_new_tokens=4, batch_size=BATCH_SIZE),
        total=len(messages),
        desc="Batched inference",
    ):
        response = out[0]["generated_text"][-1]["content"].strip()
        print(response)

if __name__ == "__main__":
    main()

Adding

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Next-80B-A3B-Instruct", padding = "left")
pipe = pipeline("text-generation", model=LANGUAGE_MODEL, tokenizer=tokenizer, device_map="auto")

Does not make the warning go away.

Expected behavior

(ecg-preprocess) (ecg-encoder) -bash-4.4$ CUDA_VISIBLE_DEVICES=4,5,6,7 uv run src/test.py
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
Loading checkpoint shards: 100%|█████████████████| 41/41 [00:31<00:00,  1.28it/s]
Device set to use cuda:0
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Batched inference:   0%|                                   | 0/6 [00:00<?, ?it/s])  
Hi!
Hi! It looks
Hello! It seems
Hello! It seems
)
) I'm here
Batched inference: 100%|████████████████████████| 6/6 [00:00<00:00, 82782.32it/s]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions