[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0.  (mixed batching in longrope))

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
Your output of `python collect_env.py` here
```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

vllm version (latest was failing due to some issues like can not decode) : 
```0.6.1.post1```
hosted the model : 
```
CUDA_VISIBLE_DEVICES=0 python3 -m  vllm.entrypoints.openai.api_server --model csp-phi-3-mini-128k-ft-outputs/qlora_merged_model_csp_phi-ckp-23850 --dtype bfloat16 --gpu-memory-utilization 0.9 --disable-log-requests --max-model-len 14000
```
```python
import requests
import json
import time
VLLM_INFER_URL = "http://0.0.0.0:8000/v1/completions"
def infer_vllm(prompt:str,max_new_tokens = 800,temp=0.0) -> str:
    '''Infer from hosted vllm server'''
        payload = json.dumps({
        "model": "csp-phi-3-mini-128k-ft-outputs/qlora_merged_model_csp_phi-ckp-23850",
        "prompt": prompt,
        "temperature": temp,
        # "top_k": 50,
        "top_p": 1,
        "max_tokens": max_new_tokens
    })
    headers = {
        'Content-Type': 'application/json'
    }
   
    try:
        status_code_failure = False
        start_time = time.time()
        response =  requests.request("POST", VLLM_INFER_URL, headers=headers, data=payload)
        if response.status_code == 200:
            resp = json.loads(response.text)["choices"][0]["text"]
            return resp
        else:
            print(response.json())
           
            return "None"
        

from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
prompts = data.prompt.tolist()
if True:
    with ThreadPoolExecutor(max_workers=5) as executor:
        list_of_results5 = list(tqdm(executor.map(infer_vllm, prompts[:10]), total=len(prompts[:10])))
 
 #first output sample - lest check second response
print(list_of_results5[2])

#vs 

print(infer_vllm(prompts[2]))

#is different i initially thought this might be due to pad tokens but i don't think so
```

what can be possible reason of that. does the model's pad tokens can affect that ? 

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0. (mixed batching in longrope)) #10336

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0. (mixed batching in longrope)) #10336

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions