Skip to content

Eval bug: Excessive stack usage during tool calling #12234

Closed
@edmcman

Description

@edmcman

Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
version: 4840 (3ffbbd5)
built with Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1exp120240731145000.144) for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

i9-13900HX + NVIDIA GeForce RTX 4070

Models

bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M

Problem description & steps to reproduce

cc/@ochafik

I am attempting to run BFCL on llama-server, and so far I have triggered a crash twice. It does not appear to be deterministic, unfortunately. In one instance, I was able to catch the crash with gdb. Here is the end of the backtrace:

#87097 0x00005669dac2b7f9 in bool std::__detail::__regex_algo_impl<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> >(__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, __gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::match_results<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> > const&, std::regex_constants::match_flag_type, std::__detail::_RegexExecutorPolicy, bool) ()
#87098 0x00007116a7f3ac54 in llama_grammar_accept_impl(llama_grammar&, int) () from /home/ed/Projects/llama.cpp/build/bin/libllama.so
#87099 0x00005669dadb179a in common_sampler_accept(common_sampler*, int, bool) ()
#87100 0x00005669dac5c626 in server_context::update_slots() ()
#87101 0x00005669dabe4886 in server_queue::start_loop() ()
#87102 0x00005669dabb0bc8 in main ()

The remaining 87096 stack frames were identical. So while I have not been able to find the exact input that triggered the crash yet, I hoped that this might be enough of a clue as to what is going on.

Here is some more information about what I am doing:

  • /home/ed/Projects/llama.cpp/build/bin/llama-server --ctx-size 0 --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M --host 0.0.0.0 -ngl 100
  • python /home/ed/Projects/gorilla/berkeley-function-call-leaderboard/venv/bin/bfcl generate --model gpt-4-turbo-2024-04-09-FC --test-category all --include-input-log
  • I added this patch:
diff --git a/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py b/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
index fbf7c0f..fc0da1f 100644
--- a/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
+++ b/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
@@ -22,7 +22,7 @@ class OpenAIHandler(BaseHandler):
     def __init__(self, model_name, temperature) -> None:
         super().__init__(model_name, temperature)
         self.model_style = ModelStyle.OpenAI
-        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"), base_url="http://localhost:8080")
 
     def decode_ast(self, result, language="Python"):
         if "FC" in self.model_name or self.is_fc_model:

First Bad Commit

No response

Relevant log output

srv  update_slots: all slots are idle
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 48450 | processing task
slot update_slots: id  0 | task 48450 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 326
slot update_slots: id  0 | task 48450 | kv cache rm [67, end)
slot update_slots: id  0 | task 48450 | prompt processing progress, n_past = 326, n_tokens = 259, progress = 0.794479
slot update_slots: id  0 | task 48450 | prompt done, n_past = 326, n_tokens = 259
slot      release: id  0 | task 48450 | stop processing: n_past = 504, truncated = 0
slot print_timing: id  0 | task 48450 | 
prompt eval time =     104.08 ms /   259 tokens (    0.40 ms per token,  2488.52 tokens per second)
       eval time =    3465.17 ms /   179 tokens (   19.36 ms per token,    51.66 tokens per second)
      total time =    3569.24 ms /   438 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 48630 | processing task
slot update_slots: id  0 | task 48630 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 326
slot update_slots: id  0 | task 48630 | kv cache rm [67, end)
slot update_slots: id  0 | task 48630 | prompt processing progress, n_past = 326, n_tokens = 259, progress = 0.794479
slot update_slots: id  0 | task 48630 | prompt done, n_past = 326, n_tokens = 259
/home/ed/.local/share/dorothy/user/commands/llama-cpp-server: line 8: 709629 Segmentation fault      (core dumped) ~/Projects/llama.cpp/build/bin/llama-server --ctx-size $CTX_SIZE --jinja -fa -hf "$MODEL" --host 0.0.0.0 -ngl $OFFLOAD_NUM $OTHERARGS

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions