Make my experimental branch support Mixtral #4

kalomaze · 2023-12-15T12:32:20Z

nuff said

happens with multi-threaded quantization of Qwen-72B ggml-ci

* enable qwen to llama.cpp * llama : do not GPU split bias tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Support attention_bias on LLaMA architecture QKVO bias, should fix InternLM (ggml-org#3133) and works for LLaMAfied Qwen models (ggml-org#3743 (comment)). * check existence of qkvo bias while loading llama models Tested on LLaMA2, CUDA and CPU. * Update llama.cpp

* Fix token_to_piece implementation in Swift * Fix errors

* llama : pad KV cache size to 32 * metal : try to improve batched decoding

(cherry picked from commit mozilla-ai/llamafile@e8c92bc)

This reverts commit a8e66ef.

ggml-ci

# Conflicts: # CMakeLists.txt # Makefile

* ggml : fix soft max out-of-bounds access ggml-ci * ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() ggml-ci

preceeding -> preceding

…ggml-org#4325)

This commit updates the error message that is printed when the KV cache is not big enough to hold all the prompt and generated tokens. Specifically it removes the reference to n_parallel and replaces it with n_len. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Samplers sequence order w parameter * Cleaned commented code * Fixed formatting * Rewrote with unordered_map * Revert and rewrite, too many problems and safeguards would be needed * Fixed code style * Code style fixes according to review * More readable samplers input string, fixed help * Style fix in sampler_queue * Formatting fixes * Fixing whitespaces

) * feat: Allow overriding GGUF metadata when loading model * Fix the one time GCC is stricter than clang about something * Step1 * Refactor... basically everything! * Nuke obsolete GetArrayLen struct * simplify std::string specialization * Various cleanups Add informational output when overrides are applied Warn user when an override with the wrong type is specified * Fix broken logic for parsing bool KV overrides Fix issue where overrides didn't apply when key missing in GGUF metadata Resolve merge changes * llama : rearrange model params * Update new GET_KEY call Add note that metadata KV overrides aren't reflected in initial metadata KV info dump --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…l-org#4330) * reserve space for codepoints * improvement for the appended 0 * used precomputed token text for grammar sample * reserve canidates_decoded * reserve canidates_grammar * remove candidates_decoded * Revert "remove candidates_decoded" This reverts commit 3773328. * changed decode_utf8 to take src by ref

* speculative: add some colors * minor : add braces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Set a more typical Top P setting as the default * Update temp max

# Conflicts: # Makefile # tests/test-grad0.cpp # tests/test-quantize-perf.cpp

* convert : support Mixtral as LLAMA arch * convert : fix n_ff typo * llama : model loading * ggml : sync latest ggml_mul_mat_id * llama : update graph to support MoE * llama : fix cur -> cur_expert * llama : first working version * llama : fix expert weighting in the FFN * ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) * ggml : add n_as argument to ggml_mul_mat_id * ggml : fix ggml_get_rows to take into account ne02 / ne11 * metal : add more general support for ggml_get_rows + tests * llama : add basic support for offloading moe with CUDA * metal : add/mul/div use general kernel when src1 not cont * metal : reduce the kernel launches for ggml_mul_mat_id * ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D * ggml : update get_rows f16 and q * cuda : support non-contiguous src1 in get_rows * llama : offload missing ffn_moe_silu * metal : fix ggml_get_rows to work with non-cont src1 * metal : add indirect mat-vec kernels for all quantization types * llama : do not quantize expert gating tensors * llama : add n_expert and n_expert_used to hparams + change quants * test-backend-ops : add moe test * cuda : fix get_rows when ncols is odd * convert : determine n_ctx correctly * metal : fix ggml_mul_mat_id for F32 * test-backend-ops : make experts more evenly probable (test_moe) * test-backend-ops : cleanup, add moe test for batches * test-backend-ops : add cpy from f32 -> all types test * test-backend-ops : fix dequantize block offset * llama : fix hard-coded number of experts * test-backend-ops : simplify and disable slow tests to avoid CI timeout * test-backend-ops : disable MOE test with thread sanitizer * cuda : fix mul_mat_id with multi gpu * convert : use 1e6 rope_freq_base for mixtral * convert : fix style * convert : support safetensors format * gguf-py : bump version * metal : add cpy f16 -> f32 kernel * metal : fix binary ops for ne10 % 4 != 0 * test-backend-ops : add one more sum_rows test * ggml : do not use BLAS with ggml_mul_mat_id * convert-hf : support for mixtral-instruct (ggml-org#4428) * convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct * convert : use sentencepiece tokenizer for Mixtral-instruct * convert : make flake8 happy * metal : fix soft_max kernels ref: ggml-org/ggml@1914017 * metal : limit kernels to not use more than the allowed threads --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Radek Pilar <github@mrkva.eu>

)

# Conflicts: # Makefile # README.md

* .sh script V1 * koboldcpp.sh polish * koboldcpp.sh dist generator * Include html's in dist * RWKV in Linux Dist * Lower dependency requirements * Eliminate wget dependency * More distinct binary name I know its technically amd64, but I don't want to cause confusion among nvidia users. * Use System OpenCL Unsure how this will behave in the pyinstaller build, but pocl ended up CPU only. With a bit of luck the pyinstaller uses the one from the actual system if compiled in a system without opencl, while conda now includes it for that specific system. * Add cblas dependency Missing this causes compile failures on some system's * ICD workaround Ideally we find a better solution, but conda forces ICD and needs this for the successful compile. However, pyinstaller then embeds the ICD causing it to be limited to the system it was compiled for. By temporarily removing the ICD pyinstaller can't find it and everything remains functional. Ideally we do this on a pyinstaller level, but I could not find any good options to do so yet. * Fix & Nocuda --------- Co-authored-by: root <root@DESKTOP-DQ1QRAG>

)

* sync : ggml (SD ops, tests, kernels) ggml-ci * cuda : restore im2col ggml-ci * metal : fix accuracy of dequantization kernels ggml-ci * cuda : restore correct im2col ggml-ci * metal : try to fix moe test by reducing expert size ggml-ci * cuda : fix bin bcast when src1 and dst have different types ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>

…reaming (ggml-org#4446)

* .sh script V1 * koboldcpp.sh polish * koboldcpp.sh dist generator * Include html's in dist * RWKV in Linux Dist * Lower dependency requirements * Eliminate wget dependency * More distinct binary name I know its technically amd64, but I don't want to cause confusion among nvidia users. * Use System OpenCL Unsure how this will behave in the pyinstaller build, but pocl ended up CPU only. With a bit of luck the pyinstaller uses the one from the actual system if compiled in a system without opencl, while conda now includes it for that specific system. * Add cblas dependency Missing this causes compile failures on some system's * ICD workaround Ideally we find a better solution, but conda forces ICD and needs this for the successful compile. However, pyinstaller then embeds the ICD causing it to be limited to the system it was compiled for. By temporarily removing the ICD pyinstaller can't find it and everything remains functional. Ideally we do this on a pyinstaller level, but I could not find any good options to do so yet. * Fix & Nocuda * Automatically build Linux Binary * Auto build on v tag * Better on release * Fix missing jobs: * More distinct name * I am to retro... * Fix release upload * Another upload attempt * Another upload attempt * Also rebuild on release edit * Placebo commit to maybe fix CI --------- Co-authored-by: root <root@DESKTOP-DQ1QRAG>

This reverts commit 7a69152.

…certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values

…3633) * Add HFVocab into convert.py * Update convert.py * Update convert.py * add bytes_to_unicode function * change add_meta_vocab fucntion * remove debug code * remove byte_encoder * Add newline between classes * Check tokenizer.json when tokenizer.model is not exist. * Move transformers dependency to local code * Add error context with 'raise from' * Add fast tokenizer option to BpeVocab * Update convert.py * Add VocabLoader and remove *Vocab class * Add transformers dependency * remove added tokens and check newline token to decide spm or bpe * Update convert.py * Add special token type * Update convert.py * Update convert.py * Update convert.py * Fix typo in convert.py * Fix when params.n_vocab < tokenizer vocab size * update vocab class * change funtion name * Remove unused variable/functions, add types to class variable and methods, delete blank liens * fix flake8 warnings * code style cleanup * make mypy happy * change exception --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai>

…certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values (cherry picked from commit 1ad8f0d)

# Conflicts: # CMakeLists.txt # Makefile # README.md

…4453)

…ered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values" This reverts commit 34b3dac.

# Conflicts: # ggml.c

ggerganov and others added 30 commits December 1, 2023 18:42

llama : fix integer overflow during quantization (ggml-org#4284)

880f579

happens with multi-threaded quantization of Qwen-72B ggml-ci

llama : add Qwen support (ggml-org#4281)

37c746d

* enable qwen to llama.cpp * llama : do not GPU split bias tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

build : enable libstdc++ assertions for debug builds (ggml-org#4275)

511f52c

swift : fix token_to_piece implementation (ggml-org#4278)

b220222

* Fix token_to_piece implementation in Swift * Fix errors

llama : support optional tensors (ggml-org#4283)

d5a1cbd

llama : avoid using "optional" keyword (ggml-org#4283)

5a7d312

token count includes ids

6570a20

llama : pad KV cache size (ggml-org#4280)

d7b800b

* llama : pad KV cache size to 32 * metal : try to improve batched decoding

py : add grammar to oai like api (ggml-org#4294)

6949b50

server : fix OpenAI API stop field to be optional (ggml-org#4299)

33e171d

(cherry picked from commit mozilla-ai/llamafile@e8c92bc)

Revert "Revert "ggml : add ggml_soft_max_ext (ggml-org#4256)""

48544cd

This reverts commit a8e66ef.

ggml : fix soft max out-of-bounds access (ggml-org#4307)

adf3de4

ggml-ci

Merge branch 'master' into concedo_experimental

ac36aee

# Conflicts: # CMakeLists.txt # Makefile

ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() (ggml-org#4308)

fbbc428

* ggml : fix soft max out-of-bounds access ggml-ci * ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() ggml-ci

Merge branch 'master' into concedo_experimental

8602f5a

grammar-parser : fix typo (ggml-org#4318)

4fa44e8

preceeding -> preceding

handle accidentally selecting a kcpps file as model instead

a5a5839

swift : fix prompt tokenization logic (ggml-org#4321)

5c9f90c

swift : fix concatenation method to avoid invalid UTF8 stringfication (…

d208995

…ggml-org#4325)

swift : revert compiler checks for swift package (ggml-org#4332)

e4b76bb

improved exit logic

b6f952f

speculative : support --color (ggml-org#4343)

da5eaef

* speculative: add some colors * minor : add braces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

common : fix compile warning

caa9249

very basic noscript mode

12002d8

noscript mode is done

c751152

kalomaze and others added 29 commits December 12, 2023 12:12

server : tweak default sampling parameters (ggml-org#4367)

fecac45

* Set a more typical Top P setting as the default * Update temp max

do not display the "maybe" MMQ console output

4db9586

Merge branch 'master' into concedo_experimental

c2c238b

# Conflicts: # Makefile # tests/test-grad0.cpp # tests/test-quantize-perf.cpp

readme : update hot topics

113f994

common : add --version option to show build info in CLI (ggml-org#4433

9fb13f9

)

Merge branch 'master' into concedo_experimental

e447af6

# Conflicts: # Makefile # README.md

update docs

2810151

build : detect host compiler and cuda compiler separately (ggml-org#4414

70f806b

)

server : fix handling of characters that span multiple tokens when st…

948ff13

…reaming (ggml-org#4446)

Merge branch 'concedo' into concedo_experimental

ec2cf6c

removing existing yml files

8dd9756

Revert "lowvram var defaults"

0e31f53

This reverts commit 7a69152.

readme : update supported model list (ggml-org#4457)

0353a18

Fixes "Not enough space in the context's memory pool" encountered on …

1ad8f0d

…certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values

Fixes "Not enough space in the context's memory pool" encountered on …

34b3dac

…certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values (cherry picked from commit 1ad8f0d)

Merge branch 'master' into concedo_experimental

c88fc19

# Conflicts: # CMakeLists.txt # Makefile # README.md

ggml : fix OpenCL broadcast requirement for ggml_mul (close ggml-org#…

55e87c3

…4453)

do not cast to size_t, instead just use doubles

05f7db4

Merge branch 'pr_fix_buf_resize_type' into concedo_experimental

53bbd1e

Revert "Fixes "Not enough space in the context's memory pool" encount…

04bd895

…ered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values" This reverts commit 34b3dac.

fixed length exceeding max ctx

f0de495

Merge branch 'master' into concedo_experimental

aac7f0b

# Conflicts: # ggml.c

manual workflow for generating builds instead

ae3d829

Workflow Build from experimental branch

7798587

kalomaze merged commit 21c1421 into exp-dynatemp-minp-latest Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make my experimental branch support Mixtral #4

Make my experimental branch support Mixtral #4

Uh oh!

kalomaze commented Dec 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

31 participants

Uh oh!

Make my experimental branch support Mixtral #4

Make my experimental branch support Mixtral #4

Uh oh!

Conversation

kalomaze commented Dec 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

31 participants