Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run glm-4-9b-chat on cuda 12 #18

Open
alabulei1 opened this issue Jul 11, 2024 · 6 comments
Open

Can't run glm-4-9b-chat on cuda 12 #18

alabulei1 opened this issue Jul 11, 2024 · 6 comments

Comments

@alabulei1
Copy link
Collaborator

When I run glm-4-9b-chat-Q5_K_M.gguf on the Cuda 12 machine, the API server can be started successfully. However, when I send a question, the API server will crash.

The command I used to start the API server is as follows:

wasmedge --dir .:. --nn-preload default:GGML:AUTO:glm-4-9b-chat-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template glm-4-chat \
  --ctx-size 4096 \
  --model-name glm-4-9b-chat

Here is the error message

[2024-07-11 07:50:12.036] [wasi_logging_stdout] [info] llama_core: llama_core::chat in llama-core/src/chat.rs:1315: Get the model metadata.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] llama_core: llama_core::chat in llama-core/src/chat.rs:1349: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] chat_completions_handler: llama_api_server::backend::ggml in llama-api-server/src/backend/ggml.rs:392: Failed to get chat completions. Reason: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] response: llama_api_server::error in llama-api-server/src/error.rs:25: 500 Internal Server Error: Failed to get chat completions. Reason: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [info] chat_completions_handler: llama_api_server::backend::ggml in llama-api-server/src/backend/ggml.rs:399: Send the chat completion response.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] response: llama_api_server in llama-api-server/src/main.rs:518: version: HTTP/1.1, body_size: 133, status: 500, is_informational: false, is_success: false, is_redirection: false, is_client_error: false, is_server_error: true

Versions

[2024-07-11 07:47:51.368] [wasi_logging_stdout] [info] server_config: llama_api_server in llama-api-server/src/main.rs:131: server version: 0.12.3
@apepkuss
Copy link
Collaborator

apepkuss commented Jul 11, 2024

According to the log message below, the model name in the request was incorrectly set to internlm2_5-7b-chat. The correct name is glm-4-9b-chat that was set to --model-name option in the command.

[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] llama_core: llama_core::chat in llama-core/src/chat.rs:1349: The model `internlm2_5-7b-chat` does not exist in the chat graphs.

@alabulei1
Copy link
Collaborator Author

In case there is something wrong with the chatbot UI, I send an API request to the model

curl -X POST http://localhost:8080/v1/chat/completions \
    -H 'accept:application/json' \
    -H 'Content-Type: application/json' \
    -d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "Plan me a two day trip in Paris"}], "model":"glm-4-9b-chat"}'
curl: (52) Empty reply from server

The log is as follows:

[2024-07-11 09:59:04.506] [info] [WASI-NN] llama.cpp: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: graph nodes  = 1606
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: graph splits = 2
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp: CUDA error: an illegal memory access was encountered
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp:   current device: 0, in function launch_mul_mat_q at /__w/wasi-nn-ggml-plugin/wasi-nn-ggml-plugin/build/_deps/llama-src/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2602
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp:   cudaFuncSetAttribute(mul_mat_q<type, mmq_x, 8, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)
GGML_ASSERT: /__w/wasi-nn-ggml-plugin/wasi-nn-ggml-plugin/build/_deps/llama-src/ggml/src/ggml-cuda.cu:101: !"CUDA error"
Aborted (core dumped)

@apepkuss
Copy link
Collaborator

A cuda error was triggered. It seems like a memory issue. @hydai Could you please help with the issue? Thanks!

@hydai
Copy link

hydai commented Jul 11, 2024

Is this model supported by llama.cpp?

@apepkuss
Copy link
Collaborator

Yes.

@hydai
Copy link

hydai commented Jul 11, 2024

Could you please try llama.cpp with CUDA enabled to run this model? Since this is an internal error in the llama.cpp CUDA backend, I would like to know if this is an upstream issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants