Description
This is a replication of the problem described in the following link #711 (comment) I met same issue so create an issue to track.
Problem statement
When chat with using the llm.create_chat_completion
API, the chatbot keeps greeting and repeat what I had input before.
Expected Behavior
I'm trying to have a continue conversation with the chat bot, and I expect to have a smooth speech flow.
Current Behavior
A conversation goes like the following with a simple chat program using llama-2-7b-chat.Q5_K_M.gguf
downloaded from https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF, the simple chat code is from the following PR #575:
Rob: Hello! It's nice to meet you. Is there something I can help you with or would you like to chat about something in particular?
User: Let's talk about football
Rob: Hello! Nice to meet you too! *adjusts avatar glasses* I'm just an AI, so I don't have personal preferences or interests, but I'd be happy to chat with you about anything you'd like. How about we talk about the latest football news? 🏈
User: I like to play football on shiny day.
Rob: Hello! *adjusts avatar hat* It's great to meet you too! I'm just an AI, so I don't have personal preferences or interests either, but I'd be happy to chat with you about anything you'd like. How about we talk about the latest football news? 🏈
Oh, you like to play football on shiny days? That's great! *adjusts avatar sunglasses* I'm just an AI, so I don't have personal experiences, but I can certainly chat with you about football. What's your favorite team or player? 🤔
The response from 'Rob' is generated by llama-cpp-python and we can clearly 'Rob' greets me again and again.
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
- Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
12th Gen Intel(R) Core(TM) i7-12700H, with hyper threading off
-
Operating System, e.g. for Linux:
$ uname -a
Linux cortex 6.5.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 13 Sep 2023 08:37:40 +0000 x86_64 GNU/Linux
-
SDK version, e.g. for Linux:
$ python3 --version
Python 3.11.3
$ make --version
GNU Make 4.4.1
$ g++ --version
g++ (GCC) 13.2.1 20230801
Failure Information (for bugs)
See current behaivor part
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
- Step 1, save the following as
simple_chat.py
from llama_cpp import ChatCompletionMessage, Llama
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="<your path to your ggml-model.bin>")
args = parser.parse_args()
llm = Llama(model_path=args.model)
# Create a list of messages
messages = [
ChatCompletionMessage(role='system', content='start chat'),
ChatCompletionMessage(role='user', content='Hello')
]
while True:
# Generate a response
response = llm.create_chat_completion(
messages,
temperature=0.2,
top_p=0.95,
top_k=40,
stream=False,
stop=[],
max_tokens=256,
repeat_penalty=1.1,
)
output = response['choices'][0]['message']['content']
print(f"Rob: {output}")
# Append assistant's message to conversation history
messages.append(ChatCompletionMessage(role='assistant', content=output))
user_message = input("User: ")
messages.append(ChatCompletionMessage(role='user', content=user_message))
- step 2, download
llama-2-7b-chat.Q5_K_M.gguf
downloaded from https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF - step 3
Run `python simple_chat.py --model - etc.
Chat freely or input prompt from 'User' in section 'Current behavior'
llama-cpp-python version:
commit 43dfe1e
llama-cpp version:
commit 48edda3
Failure Logs
See 'current behavior section'