Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't Quantize gguf files: zsh: illegal hardware instruction on M1 MacBook Pro #3983

Closed
joseph777111 opened this issue Nov 8, 2023 · 8 comments
Labels
bug-unconfirmed macos Issues specific to macOS

Comments

@joseph777111
Copy link

joseph777111 commented Nov 8, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [Y] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [Y] I carefully followed the README.md.
  • [Y] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [Y] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Successfully quantize and run large language models that I convert to gguf on M1 MacBook Pro

Current Behavior

Quantization halts due to "zsh: illegal hardware instruction".

Environment and Context

OS: Mac OS Sonoma
System: 2020 M1 MacBook Pro 16GB RAM
Xcode: Version 15.0.1 (15A507)
Apple clang version 15.0.0 (clang-1500.0.40.1)
Make 3.81 (GNU)
Python 3.11.5
Homebrew 4.1.19
Anaconda3 (23.10.0)

llama.cpp $ git log | head -1 
commit 381efbf480959bb6d1e247a8b0c2328f22e350f8

$ uname -a 
Darwin Kernel Version 23.1.0: Mon Oct  9 21:28:12 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T8103 arm64

$ python -m pip list | egrep "torch|numpy|sentencepiece"
numpy                         1.24.4
numpydoc                      1.5.0
torch                         2.1.0
torchvision                   0.16.0

$ python3 --version
Python 3.11.5

$ make  --version | head -1
GNU Make 4.3

$ g++ --version
Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: arm64-apple-darwin23.1.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

$ brew --version
Homebrew 4.1.19

$ conda --version
conda 23.10.0

$ file quantize
quantize: Mach-O 64-bit executable arm64

Failure Information (for bugs)

zsh: illegal hardware instruction

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Download Llama-2-13B-Chat files from huggingface
  2. Convert Llama-2-13B-Chat to gguf (F16)
  3. Attempt to quantize ggml-model-f16.gguf
  4. quantize should halt mid-way through quantization process with "zsh: illegal hardware instruction" error.

Failure Logs

$ ../llama.cpp/quantize ggml-model-f16.gguf test.gguf 17
main: build = 1493 (381efbf)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.1.0
main: quantizing 'ggml-model-f16.gguf' to 'test.gguf' as Q5_K
llama_model_loader: loaded meta data with 18 key-value pairs and 363 tensors from ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    6:            blk.0.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    7:              blk.0.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   15:            blk.1.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   16:              blk.1.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   24:            blk.2.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   25:              blk.2.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   33:            blk.3.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   34:              blk.3.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   42:            blk.4.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   43:              blk.4.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   51:            blk.5.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   52:              blk.5.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   60:            blk.6.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   61:              blk.6.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   69:            blk.7.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   70:              blk.7.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   78:            blk.8.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   79:              blk.8.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   87:            blk.9.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   88:              blk.9.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   96:           blk.10.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   97:             blk.10.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  105:           blk.11.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  106:             blk.11.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  114:           blk.12.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  115:             blk.12.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  123:           blk.13.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  124:             blk.13.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  132:           blk.14.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  133:             blk.14.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  141:           blk.15.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  142:             blk.15.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  150:           blk.16.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  151:             blk.16.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  159:           blk.17.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  160:             blk.17.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  168:           blk.18.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  169:             blk.18.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  177:           blk.19.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  178:             blk.19.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  186:           blk.20.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  187:             blk.20.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  195:           blk.21.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  196:             blk.21.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  204:           blk.22.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  205:             blk.22.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  213:           blk.23.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  214:             blk.23.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  222:           blk.24.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  223:             blk.24.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  231:           blk.25.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  232:             blk.25.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  240:           blk.26.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  241:             blk.26.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  249:           blk.27.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  250:             blk.27.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  258:           blk.28.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  259:             blk.28.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  267:           blk.29.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  268:             blk.29.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  276:           blk.30.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  277:             blk.30.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  285:           blk.31.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  286:             blk.31.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  289:             blk.32.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  290:             blk.32.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  291:             blk.32.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  292:        blk.32.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  293:           blk.32.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  294:           blk.32.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  295:             blk.32.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  296:          blk.32.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  297:           blk.32.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  298:             blk.33.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  299:             blk.33.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  300:             blk.33.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  301:        blk.33.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  302:           blk.33.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  303:           blk.33.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  304:             blk.33.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  305:          blk.33.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  306:           blk.33.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  307:             blk.34.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  308:             blk.34.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  309:             blk.34.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  310:        blk.34.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  311:           blk.34.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  312:           blk.34.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  313:             blk.34.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  314:          blk.34.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  315:           blk.34.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  316:             blk.35.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  317:             blk.35.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  318:             blk.35.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  319:        blk.35.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  320:           blk.35.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  321:           blk.35.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  322:             blk.35.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  323:          blk.35.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  324:           blk.35.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  325:             blk.36.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  326:             blk.36.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  327:             blk.36.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  328:        blk.36.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  329:           blk.36.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  330:           blk.36.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  331:             blk.36.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  332:          blk.36.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  333:           blk.36.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  334:             blk.37.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  335:             blk.37.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  336:             blk.37.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  337:        blk.37.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  338:           blk.37.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  339:           blk.37.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  340:             blk.37.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  341:          blk.37.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  342:           blk.37.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  343:             blk.38.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  344:             blk.38.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  345:             blk.38.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  346:        blk.38.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  347:           blk.38.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  348:           blk.38.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  349:             blk.38.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  350:          blk.38.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  351:           blk.38.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  352:             blk.39.attn_q.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  353:             blk.39.attn_k.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  354:             blk.39.attn_v.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  355:        blk.39.attn_output.weight f16      [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  356:           blk.39.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  357:           blk.39.ffn_down.weight f16      [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  358:             blk.39.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  359:          blk.39.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  360:           blk.39.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  361:               output_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  362:                    output.weight f16      [  5120, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llama_model_quantize_internal: meta size = 745344 bytes

**[   1/ 363]                    token_embd.weight - [ 5120, 32000,     1,     1], type =    f16, quantizing to q5_K .. zsh: illegal hardware instruction  ../llama.cpp/quantize ggml-model-f16.gguf test.gguf 17**
@ggerganov
Copy link
Owner

ggerganov commented Nov 8, 2023

This started occurring both on my M1 Pro and M2 Ultra after updating to Sonoma.
It only occurs with K-quants and with -O3. It works with -O2, so temporary workaround is to quantize with -O2.

I tried to debug this, but adding prints in the quantization functions makes the issue disappear. If anybody has any ideas how to fix this, please share

@joseph777111
Copy link
Author

joseph777111 commented Nov 8, 2023

Thanks, @ggerganov, I was beginning to think I was the only one experiencing this. It's nice to know that you are aware of it, and have been working on finding a solution. And, thank you for the temporary workaround, I'll use this for now.

@TortoiseHam
Copy link
Contributor

@ggerganov , running LLDB on the code seems to be able to catch a little more info about the crash:

lldb ./quantize
breakpoint set --file llama.cpp --line 7647
run /Users/skynet/Development/llama.cpp/models/llama-2-70b-chat.gguf Q4_K

llama_model_quantize_internal: meta size = 766752 bytes
[   1/ 723]                    token_embd.weight - [ 8192, 32000,     1,     1], type =    f16, quantizing to q4_K .. Process 62282 stopped
* thread #2, stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e00a2a3)
    frame #0: 0x00000001000827bc quantize`quantize_row_q4_K_reference at ggml-quants.c:1277:13 [opt]
   1274	    float sum_w = weights[0];
   1275	    float sum_x = sum_w * x[0];
   1276	    for (int i = 1; i < n; ++i) {
-> 1277	        if (x[i] < min) min = x[i];
   1278	        if (x[i] > max) max = x[i];
   1279	        float w = weights[i];
   1280	        sum_w += w;
  thread #3, stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e00a2a3)
    frame #0: 0x00000001000827bc quantize`quantize_row_q4_K_reference at ggml-quants.c:1277:13 [opt]
   1274	    float sum_w = weights[0];
   1275	    float sum_x = sum_w * x[0];
   1276	    for (int i = 1; i < n; ++i) {
-> 1277	        if (x[i] < min) min = x[i];
   1278	        if (x[i] > max) max = x[i];
   1279	        float w = weights[i];
   1280	        sum_w += w;
  thread #4, stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e00a2a3)
    frame #0: 0x00000001000827bc quantize`quantize_row_q4_K_reference at ggml-quants.c:1277:13 [opt]
   1274	    float sum_w = weights[0];
   1275	    float sum_x = sum_w * x[0];
   1276	    for (int i = 1; i < n; ++i) {
-> 1277	        if (x[i] < min) min = x[i];
   1278	        if (x[i] > max) max = x[i];
   1279	        float w = weights[i];
   1280	        sum_w += w;
  thread #5, stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e00a2a3)
    frame #0: 0x00000001000827bc quantize`quantize_row_q4_K_reference at ggml-quants.c:1277:13 [opt]
   1274	    float sum_w = weights[0];
   1275	    float sum_x = sum_w * x[0];
   1276	    for (int i = 1; i < n; ++i) {
-> 1277	        if (x[i] < min) min = x[i];
   1278	        if (x[i] > max) max = x[i];
   1279	        float w = weights[i];
   1280	        sum_w += w;
Target 0: (quantize) stopped.

whereas if I do the same thing with the '-O2' flag instead then I get:

llama_model_quantize_internal: meta size = 766752 bytes
[   1/ 723]                    token_embd.weight - [ 8192, 32000,     1,     1], type =    f16, quantizing to q4_K .. Process 63231 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x000000010003894c quantize`::llama_model_quantize(const char *, const char *, const llama_model_quantize_params *) at llama.cpp:7647:13 [opt]
   7644	                workers.clear();
   7645	            }
   7646	
-> 7647	            LLAMA_LOG_INFO("size = %8.2f MB -> %8.2f MB | hist: ", ggml_nbytes(tensor)/1024.0/1024.0, new_size/1024.0/1024.0);
   7648	            int64_t tot_count = 0;
   7649	            for (size_t i = 0; i < hist_cur.size(); i++) {
   7650	                hist_all[i] += hist_cur[i];
Target 0: (quantize) stopped.

(So the code is able to make it out of the workers loop in O2 but dies within that loop in O3).

@TortoiseHam
Copy link
Contributor

Some variable info at the crash site:

(lldb) frame variable min
(float) min = -0.00257873535
(lldb) frame variable max
(float) max = 0.00245666504
(lldb) frame variable i
(int) i = 5
(lldb) frame variable x
(const float *__restrict) x = 0x0000000280010000
(lldb) frame variable x[5]
(const float) x[5] = 0.00506591797
(lldb) frame variable n
(int) n = 32

Also, it got lost when posting here, but in terminal the first "x" in line 1277 is underlined by lldb when it gives the EXC_BAD_INSTRUCTION error

@TortoiseHam
Copy link
Contributor

Another potentially relevant thing. When building with -O2 then:

(lldb) breakpoint set --file ggml-quants.c --line 1277
Breakpoint 1: where = quantize`make_qkx2_quants + 120 at ggml-quants.c:1277:13, address = 0x00000001000738a8

Whereas when building with -O3 then:

(lldb) breakpoint set --file ggml-quants.c --line 1277
Breakpoint 1: 3 locations.

So the O3 optimization level is doing some kind of unrolling or other replication that is causing the problematic line to show up in 3 different places rather than 1.

@TortoiseHam
Copy link
Contributor

A PR that allows me to run quantization again while keeping -O3 quantization: #4052

@cebtenzzre
Copy link
Collaborator

As mentioned in #4052 I can reproduce this with clang but not with homebrew gcc. Oddly enough, it seems to be a linker issue. Here's the code before linking:

$ otool -tvVj build/CMakeFiles/ggml.dir/ggml-quants.c.o -p _quantize_row_q4_K_reference | grep -m1 -C3 'fcsel\ts5, s20, s5, gt'
000000000000464c	1e234e83	fcsel	s3, s20, s3, mi
0000000000004650	1e252280	fcmp	s20, s5
0000000000004654	bd00a3f4	str	s20, [sp, #0xa0]
0000000000004658	1e25ce85	fcsel	s5, s20, s5, gt
000000000000465c	bd409ff4	ldr	s20, [sp, #0x9c]
0000000000004660	1e232280	fcmp	s20, s3
0000000000004664	1e234e83	fcsel	s3, s20, s3, mi
0000000000004668	1e252280	fcmp	s20, s5
000000000000466c	1e25ce85	fcsel	s5, s20, s5, gt
0000000000004670	1e2322a0	fcmp	s21, s3
0000000000004674	1e234ea3	fcsel	s3, s21, s3, mi
0000000000004678	1e2522a0	fcmp	s21, s5

And here's the code after linking:

$ otool -tvVj build/bin/quantize -p _quantize_row_q4_K_reference | grep -m1 -C3 'fcsel\ts5, s20, s5, gt' 
0000000100080b84	1e234e83	fcsel	s3, s20, s3, mi
0000000100080b88	1e252280	fcmp	s20, s5
0000000100080b8c	bd00a3f4	str	s20, [sp, #0xa0]
0000000100080b90	1e25ce85	fcsel	s5, s20, s5, gt
0000000100080b94	bd409ff4	ldr	s20, [sp, #0x9c]
0000000100080b98	1e232280	fcmp	s20, s3
0000000100080b9c	1e234e83	fcsel	s3, s20, s3, mi
0000000100080ba0	1e252280	fcmp	s20, s5
0000000100080ba4	1e25ce85	fcsel	s5, s20, s5, gt
0000000100080ba8	1e2322a0	fcmp	s21, s3
0000000100080bac	1e0062a3	.long	0x1e0062a3
0000000100080bb0	1e2522a0	fcmp	s21, s5

Note how this line:

0000000000004674	1e234ea3	fcsel	s3, s21, s3, mi

Has turned into this:

0000000100080bac	1e0062a3	.long	0x1e0062a3

This is my linker:

$ ld -v
@(#)PROGRAM:ld  PROJECT:dyld-1015.7
BUILD 18:48:48 Aug 22 2023
configured to support archs: armv6 armv7 armv7s arm64 arm64e arm64_32 i386 x86_64 x86_64h armv6m armv7k armv7m armv7em
will use ld-classic for: armv6 armv7 armv7s arm64_32 i386 armv6m armv7k armv7m armv7em
LTO support using: LLVM version 15.0.0 (static support for 29, runtime is 29)
TAPI support using: Apple TAPI version 15.0.0 (tapi-1500.0.12.3)
Library search paths:
Framework search paths:

@cebtenzzre
Copy link
Collaborator

Fixed in #4052

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed macos Issues specific to macOS
Projects
None yet
Development

No branches or pull requests

4 participants