-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Pixtral inference not working correctly with LLMEngine/AsyncEngine #8411
Comments
Please see if the solutions in #8382 can solve the issue encountered by you |
Also currently looking into the problem - also maybe see: #8415 |
Hey @larme, Upon taking a closer look the problem here is actually not related to the initialization of the model, but occurs because images and the prompt are independently passed before being processed by mistral common's tokenizer. When using the image and prompt have to be processed together by mistral common to ensure that the tokens are in the right format. See post below for code snippet. |
@DarkLight1337 I'm not sure what the best way is to make sure users always pre-process with the MistralTokenizer. It's done automatically whenever requests are passed in chat format. Should we maybe throw an error if people try to pass a raw prompt to Pixtral so that the above error doesn't happen too much? |
If you have special tokens that are only available via your tokenizer, you may search the text for those tokens inside the input processor and throw an error if none are found. |
Great idea - I'll add this to #8415 |
To close the loop - #8415 should fix images with incorrect image init & resizing. Two things that I noticed:
Here a complete more complex example that should illustrate how to use the model: import PIL.Image
import uuid
from vllm import EngineArgs, LLMEngine
from vllm import SamplingParams, TokensPrompt
from vllm.multimodal import MultiModalDataBuiltins
from mistral_common.protocol.instruct.messages import (
UserMessage,
TextChunk,
ImageURLChunk,
ImageChunk,
)
from PIL import Image
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
# MODEL_ID = "mistral-community/pixtral-12b-240910"
MODEL_ID = "mistralai/Pixtral-12B-2409"
ENGINE_ARGS = EngineArgs(
model=MODEL_ID,
tokenizer_mode="mistral",
enable_chunked_prefill=False,
limit_mm_per_prompt=dict(image=4),
max_num_batched_tokens=16384,
max_model_len=16384,
)
SAMPLING_PARAM = SamplingParams(temperature=0.0, max_tokens=512)
prompt = "describe the images"
image = PIL.Image.open("demo.jpg").resize((400, 500))
image_2 = PIL.Image.open("demo_2.jpg").resize((560, 800))
image_3 = PIL.Image.open("demo_3.jpg").resize((150, 200))
image_4 = PIL.Image.open("demo_4.jpg").resize((344, 444))
engine = LLMEngine.from_engine_args(ENGINE_ARGS)
tokenizer = engine.tokenizer.tokenizer.mistral
def create_image_input(images, prompt):
# tokenize images and text
tokenized = tokenizer.encode_chat_completion(
ChatCompletionRequest(
messages=[
UserMessage(
content=[
TextChunk(text=prompt),
] + [ImageChunk(image=img) for img in images]
)
],
model="pixtral",
)
)
engine_inputs = TokensPrompt(prompt_token_ids=tokenized.tokens)
mm_data = MultiModalDataBuiltins(image=images)
engine_inputs["multi_modal_data"] = mm_data
return engine_inputs
engine.add_request(uuid.uuid4().hex, create_image_input([image, image_3], prompt), SAMPLING_PARAM)
engine.add_request(uuid.uuid4().hex, create_image_input([image_2], prompt), SAMPLING_PARAM)
count = 0
while True:
out = engine.step()
count += 1
for request_output in out:
if request_output.finished:
print(request_output.outputs[0].text)
if count == 2:
engine.add_request(uuid.uuid4().hex, create_image_input([image, image_4], prompt), SAMPLING_PARAM)
if not engine.has_unfinished_requests():
break |
thanks @patrickvonplaten ! This also works wonderfully in AsyncLLMEngine. We made an example here: https://github.com/bentoml/BentoVLLM/blob/main/pixtral-12b/service.py |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
This code snippets does not work
This will give an output message like:
It seems that I need to manually padding the input token ids with image_token_ids like this:
To make it work.
AsyncLLMEngine
also see the same limitation and I need similar modification to make it works like: https://github.com/bentoml/BentoVLLM/pull/71/files#diff-357f77bce00e63217bb5ec382293bca653276e58af9bdfb6e7c50ca9487e27aeR84-R94Is this behavior intended or a bug?
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: