Skip to content

[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0. (mixed batching in longrope)) #10336

Closed as not planned
@bhupendrathore

Description

@bhupendrathore

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

Model Input Dumps

No response

🐛 Describe the bug

vllm version (latest was failing due to some issues like can not decode) :
0.6.1.post1
hosted the model :

CUDA_VISIBLE_DEVICES=0 python3 -m  vllm.entrypoints.openai.api_server --model csp-phi-3-mini-128k-ft-outputs/qlora_merged_model_csp_phi-ckp-23850 --dtype bfloat16 --gpu-memory-utilization 0.9 --disable-log-requests --max-model-len 14000
import requests
import json
import time
VLLM_INFER_URL = "http://0.0.0.0:8000/v1/completions"
def infer_vllm(prompt:str,max_new_tokens = 800,temp=0.0) -> str:
    '''Infer from hosted vllm server'''
        payload = json.dumps({
        "model": "csp-phi-3-mini-128k-ft-outputs/qlora_merged_model_csp_phi-ckp-23850",
        "prompt": prompt,
        "temperature": temp,
        # "top_k": 50,
        "top_p": 1,
        "max_tokens": max_new_tokens
    })
    headers = {
        'Content-Type': 'application/json'
    }
   
    try:
        status_code_failure = False
        start_time = time.time()
        response =  requests.request("POST", VLLM_INFER_URL, headers=headers, data=payload)
        if response.status_code == 200:
            resp = json.loads(response.text)["choices"][0]["text"]
            return resp
        else:
            print(response.json())
           
            return "None"
        

from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
prompts = data.prompt.tolist()
if True:
    with ThreadPoolExecutor(max_workers=5) as executor:
        list_of_results5 = list(tqdm(executor.map(infer_vllm, prompts[:10]), total=len(prompts[:10])))
 
 #first output sample - lest check second response
print(list_of_results5[2])

#vs 

print(infer_vllm(prompts[2]))

#is different i initially thought this might be due to pad tokens but i don't think so

what can be possible reason of that. does the model's pad tokens can affect that ?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions