Description
Name and Version
./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
version: 4840 (3ffbbd5)
built with Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1exp120240731145000.144) for x86_64-pc-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
i9-13900HX + NVIDIA GeForce RTX 4070
Models
bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M
Problem description & steps to reproduce
cc/@ochafik
I am attempting to run BFCL on llama-server, and so far I have triggered a crash twice. It does not appear to be deterministic, unfortunately. In one instance, I was able to catch the crash with gdb. Here is the end of the backtrace:
#87097 0x00005669dac2b7f9 in bool std::__detail::__regex_algo_impl<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> >(__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, __gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::match_results<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> > const&, std::regex_constants::match_flag_type, std::__detail::_RegexExecutorPolicy, bool) ()
#87098 0x00007116a7f3ac54 in llama_grammar_accept_impl(llama_grammar&, int) () from /home/ed/Projects/llama.cpp/build/bin/libllama.so
#87099 0x00005669dadb179a in common_sampler_accept(common_sampler*, int, bool) ()
#87100 0x00005669dac5c626 in server_context::update_slots() ()
#87101 0x00005669dabe4886 in server_queue::start_loop() ()
#87102 0x00005669dabb0bc8 in main ()
The remaining 87096 stack frames were identical. So while I have not been able to find the exact input that triggered the crash yet, I hoped that this might be enough of a clue as to what is going on.
Here is some more information about what I am doing:
/home/ed/Projects/llama.cpp/build/bin/llama-server --ctx-size 0 --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M --host 0.0.0.0 -ngl 100
python /home/ed/Projects/gorilla/berkeley-function-call-leaderboard/venv/bin/bfcl generate --model gpt-4-turbo-2024-04-09-FC --test-category all --include-input-log
- I added this patch:
diff --git a/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py b/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
index fbf7c0f..fc0da1f 100644
--- a/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
+++ b/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
@@ -22,7 +22,7 @@ class OpenAIHandler(BaseHandler):
def __init__(self, model_name, temperature) -> None:
super().__init__(model_name, temperature)
self.model_style = ModelStyle.OpenAI
- self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+ self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"), base_url="http://localhost:8080")
def decode_ast(self, result, language="Python"):
if "FC" in self.model_name or self.is_fc_model:
First Bad Commit
No response
Relevant log output
srv update_slots: all slots are idle
srv log_server_r: request: POST /chat/completions 127.0.0.1 200
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 48450 | processing task
slot update_slots: id 0 | task 48450 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 326
slot update_slots: id 0 | task 48450 | kv cache rm [67, end)
slot update_slots: id 0 | task 48450 | prompt processing progress, n_past = 326, n_tokens = 259, progress = 0.794479
slot update_slots: id 0 | task 48450 | prompt done, n_past = 326, n_tokens = 259
slot release: id 0 | task 48450 | stop processing: n_past = 504, truncated = 0
slot print_timing: id 0 | task 48450 |
prompt eval time = 104.08 ms / 259 tokens ( 0.40 ms per token, 2488.52 tokens per second)
eval time = 3465.17 ms / 179 tokens ( 19.36 ms per token, 51.66 tokens per second)
total time = 3569.24 ms / 438 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /chat/completions 127.0.0.1 200
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 48630 | processing task
slot update_slots: id 0 | task 48630 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 326
slot update_slots: id 0 | task 48630 | kv cache rm [67, end)
slot update_slots: id 0 | task 48630 | prompt processing progress, n_past = 326, n_tokens = 259, progress = 0.794479
slot update_slots: id 0 | task 48630 | prompt done, n_past = 326, n_tokens = 259
/home/ed/.local/share/dorothy/user/commands/llama-cpp-server: line 8: 709629 Segmentation fault (core dumped) ~/Projects/llama.cpp/build/bin/llama-server --ctx-size $CTX_SIZE --jinja -fa -hf "$MODEL" --host 0.0.0.0 -ngl $OFFLOAD_NUM $OTHERARGS