-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support StarCoder model architectures #3187
Conversation
Looks good so far - let us know if you hit any roadblocks |
The remaining part for now is from line 3580 to line 3718 in llama.cpp. It should not be very hard to figure it out once I have set up a development environment to ensure the matrix shape arithmetic is correct... |
OK, I think I got a version running under CPU: > make main && ./bin/main -m ../models/starcoder-1b.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 0
system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:
def dijkstra(graph, start):
"""
Returns the shortest path from `start` to all other nodes in `graph`.
The graph is represented as a dictionary of dictionaries. Each key represents a node and each value is another dictionary with keys 'to' and 'cost'.
"""
# Initialize the distances array to infinity
distances = [float('inf') for _ in range(len(graph))]
distances[start] = 0
# Initialize the previous array to None
previous = [None for _ in range(len(graph))]
# Loop through all nodes and find the shortest path
llama_print_timings: load time = 110.20 ms
llama_print_timings: sample time = 134.80 ms / 128 runs ( 1.05 ms per token, 949.55 tokens per second)
llama_print_timings: prompt eval time = 262.29 ms / 20 tokens ( 13.11 ms per token, 76.25 tokens per second)
llama_print_timings: eval time = 3485.94 ms / 127 runs ( 27.45 ms per token, 36.43 tokens per second)
llama_print_timings: total time = 3914.92 ms But it's currently buggy in metal: > make main && ./bin/main -m ../models/starcoder-1b.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 1
system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:
<|endoftext|> [end of text]
llama_print_timings: load time = 232.01 ms
llama_print_timings: sample time = 1.26 ms / 1 runs ( 1.26 ms per token, 791.14 tokens per second)
llama_print_timings: prompt eval time = 21.64 ms / 20 tokens ( 1.08 ms per token, 924.17 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 23.29 ms
ggml_metal_free: deallocating
Log end Looking into it... Edited: |
@wsxiaoys There was a bug in the soft max Metal kernel. Can you give me access to push a fix? $ ▶ git push tabbyml HEAD:support-starcoder
remote: Permission to TabbyML/llama.cpp.git denied to ggerganov.
fatal: unable to access 'https://github.com/TabbyML/llama.cpp/': The requested URL returned error: 403 Or I can push it to a branch in this repo? Anyway work for me Edit: created a PR here - TabbyML#2 |
Support starcoder fix
Thanks for the fix! Will cleanup the impl a bit then send it out for review, here're some benchmark numbers
|
Follow-up PRs:
|
feat: support starcoder mqa
PR Ready for review now 🥇 |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
gguf-py/gguf/gguf.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@monatis Do we need to bump gguf.py
version after this change?
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add placeholder of starcoder in gguf / llama.cpp * support convert starcoder weights to gguf * convert MQA to MHA * fix ffn_down name * add LLM_ARCH_STARCODER to llama.cpp * set head_count_kv = 1 * load starcoder weight * add max_position_embeddings * set n_positions to max_positioin_embeddings * properly load all starcoder params * fix head count kv * fix comments * fix vram calculation for starcoder * store mqa directly * add input embeddings handling * add TBD * working in cpu, metal buggy * cleanup useless code * metal : fix out-of-bounds access in soft_max kernels * llama : make starcoder graph build more consistent with others * refactor: cleanup comments a bit * add other starcoder models: 3B, 7B, 15B * support-mqa-directly * fix: remove max_position_embeddings, use n_train_ctx * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix: switch to space from tab --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
#3076
Still in progress, while the model converting / params loading part seems to be working.Tabby has integrated llama.cpp and released v0.1.1 🎉. It now offers native support for metal inference and the StarCoder model!