Closed as not planned
Description
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
Model Input Dumps
No response
🐛 Describe the bug
vllm version (latest was failing due to some issues like can not decode) :
0.6.1.post1
hosted the model :
CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server --model csp-phi-3-mini-128k-ft-outputs/qlora_merged_model_csp_phi-ckp-23850 --dtype bfloat16 --gpu-memory-utilization 0.9 --disable-log-requests --max-model-len 14000
import requests
import json
import time
VLLM_INFER_URL = "http://0.0.0.0:8000/v1/completions"
def infer_vllm(prompt:str,max_new_tokens = 800,temp=0.0) -> str:
'''Infer from hosted vllm server'''
payload = json.dumps({
"model": "csp-phi-3-mini-128k-ft-outputs/qlora_merged_model_csp_phi-ckp-23850",
"prompt": prompt,
"temperature": temp,
# "top_k": 50,
"top_p": 1,
"max_tokens": max_new_tokens
})
headers = {
'Content-Type': 'application/json'
}
try:
status_code_failure = False
start_time = time.time()
response = requests.request("POST", VLLM_INFER_URL, headers=headers, data=payload)
if response.status_code == 200:
resp = json.loads(response.text)["choices"][0]["text"]
return resp
else:
print(response.json())
return "None"
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
prompts = data.prompt.tolist()
if True:
with ThreadPoolExecutor(max_workers=5) as executor:
list_of_results5 = list(tqdm(executor.map(infer_vllm, prompts[:10]), total=len(prompts[:10])))
#first output sample - lest check second response
print(list_of_results5[2])
#vs
print(infer_vllm(prompts[2]))
#is different i initially thought this might be due to pad tokens but i don't think so
what can be possible reason of that. does the model's pad tokens can affect that ?
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.