feat: support StarCoder model architectures#3187
Conversation
|
Looks good so far - let us know if you hit any roadblocks |
The remaining part for now is from line 3580 to line 3718 in llama.cpp. It should not be very hard to figure it out once I have set up a development environment to ensure the matrix shape arithmetic is correct... |
|
OK, I think I got a version running under CPU: > make main && ./bin/main -m ../models/starcoder-1b.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 0
system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:
def dijkstra(graph, start):
"""
Returns the shortest path from `start` to all other nodes in `graph`.
The graph is represented as a dictionary of dictionaries. Each key represents a node and each value is another dictionary with keys 'to' and 'cost'.
"""
# Initialize the distances array to infinity
distances = [float('inf') for _ in range(len(graph))]
distances[start] = 0
# Initialize the previous array to None
previous = [None for _ in range(len(graph))]
# Loop through all nodes and find the shortest path
llama_print_timings: load time = 110.20 ms
llama_print_timings: sample time = 134.80 ms / 128 runs ( 1.05 ms per token, 949.55 tokens per second)
llama_print_timings: prompt eval time = 262.29 ms / 20 tokens ( 13.11 ms per token, 76.25 tokens per second)
llama_print_timings: eval time = 3485.94 ms / 127 runs ( 27.45 ms per token, 36.43 tokens per second)
llama_print_timings: total time = 3914.92 msBut it's currently buggy in metal: > make main && ./bin/main -m ../models/starcoder-1b.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 1
system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:
<|endoftext|> [end of text]
llama_print_timings: load time = 232.01 ms
llama_print_timings: sample time = 1.26 ms / 1 runs ( 1.26 ms per token, 791.14 tokens per second)
llama_print_timings: prompt eval time = 21.64 ms / 20 tokens ( 1.08 ms per token, 924.17 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 23.29 ms
ggml_metal_free: deallocating
Log endLooking into it... Edited: |
|
@wsxiaoys There was a bug in the soft max Metal kernel. Can you give me access to push a fix? $ ▶ git push tabbyml HEAD:support-starcoder
remote: Permission to TabbyML/llama.cpp.git denied to ggerganov.
fatal: unable to access 'https://github.com/TabbyML/llama.cpp/': The requested URL returned error: 403Or I can push it to a branch in this repo? Anyway work for me Edit: created a PR here - TabbyML#2 |
Support starcoder fix
|
Thanks for the fix! Will cleanup the impl a bit then send it out for review, here're some benchmark numbers |
|
Follow-up PRs:
|
feat: support starcoder mqa
|
PR Ready for review now 🥇 |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
There was a problem hiding this comment.
@monatis Do we need to bump gguf.py version after this change?
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add placeholder of starcoder in gguf / llama.cpp * support convert starcoder weights to gguf * convert MQA to MHA * fix ffn_down name * add LLM_ARCH_STARCODER to llama.cpp * set head_count_kv = 1 * load starcoder weight * add max_position_embeddings * set n_positions to max_positioin_embeddings * properly load all starcoder params * fix head count kv * fix comments * fix vram calculation for starcoder * store mqa directly * add input embeddings handling * add TBD * working in cpu, metal buggy * cleanup useless code * metal : fix out-of-bounds access in soft_max kernels * llama : make starcoder graph build more consistent with others * refactor: cleanup comments a bit * add other starcoder models: 3B, 7B, 15B * support-mqa-directly * fix: remove max_position_embeddings, use n_train_ctx * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix: switch to space from tab --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
#3076
Still in progress, while the model converting / params loading part seems to be working.Tabby has integrated llama.cpp and released v0.1.1 🎉. It now offers native support for metal inference and the StarCoder model!