-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: ggml-aarch64.c does not compile on Windows ARM64 with MSVC #8446
Comments
@Dibakar Could you suggest a fix? |
MSVC does not support any asm for x64 or arm64. |
FYI - with llama-bench for llama-2-7b-chat.Q4_0.gguf there was no measurable performance difference between the clang and MSVC (+my quick-fix) compiled variants. |
For clang, to use the aarch64 optimized kernels for Q4_0 from ggml-aarch64.c, we need to first convert (requantize) Q4_0.gguf to Q4_0_4_8.gguf (or Q4_0_4_4.gguf) before using a cli. I am not sure if this is what you wanted. Otherwise, the use of llama-2-7b-chat.Q4_0.gguf (Q4_0 type) will not touch the code from ggml-aarch64.c and and performance will remain unchanged. Below is the command line to requantize a Q4_0.gguf first. This enables the Q4_0_4_8 quant type to use the Q4_0_4_8 assembly kernel from ggml-aarch64.c. Q4_0_4_8 quant type is fine if the Windows hardware supports both ARM_NEON and ARM_MATMUL_INT8. Otherwise, the Q4_0_4_4 type should be used, which only needs ARM_NEON support. If you need to use Q4_0_4_4, then build using ./llama-quantize --allow-requantize llama-2-7b-chat.Q4_0.gguf llama-2-7b-chat.Q4_0_4_8.gguf Q4_0_4_8 |
@Dibakar I just wanted to compile with MSVC arm64, and your PR now breaks all the llama.cpp build-scripts with MSVC arm64, because you did not exclude MSVC before inlining asm. I suggest to update your ggml-aarch64.c with appending "&& ! defined(_MSC_VER)" to the conditionals in line 502 and 1272. Thanks a lot for the clang optimization clarification, I will try it later to see if it improves Snapdragon X Windows performance. |
@Dibakar - any progress on this (the bug is still "unconfirmed")? This (ggml-aarch64.c) breaks all llama.cpp Windows on Arm cmake builds with MSVC and there is an easy fix with e.g. a conditional-compile condition for excluding MSVC from all _asm_ (see above). |
It would be better to find a way to inline assembly code with MSVC instead of disabling it all together |
@ggerganov - re inline-assembly: MSVC only supports inline-assembly with 32-bit x86. For arm64, we would need to re-code all _asm_ into intrinsics for MSVC. These intrinsics are beyond my capabilites, and I don't think they e.g. cover the .inst binary instruction-codes in ggml-aarch64.c line 426... For MSVC-builds it would need to be direct assembly code sources. I also could not yet get the Q4_0_4_4 and Q4_0_4_8 requantized llama-2 models to work on SnapdragonX with the optimized code. Neither with Windows/clang nor Linux/gcc - both build OK. llama-cli with it is 10x faster but generates totally wrong token-results/gibberish or throws an error. Maybe my re-quantization was bad, or I made some other build error, or the Snapdragon X does not support some of the features the llama.cpp common cmake scripts (arm64-windows-llvm.cmake,...) indicate. I also could not yet get it to work on macOS on my M2. So I'm still looking into first getting it to work reliably on macOS/M2, then on clang/Windows/SnapdragonX and gcc/WSL2+Ubuntu24.04/SnapdragonX. I will raise an issue, if I cannot get it to run reliably on Snapdragon X, after it works for me on the M2. I strongly suggest for now to just disable the _asm_ generation with MSVC on arm for ggml-aarch64.c and fall-back on the slower code or error with MSVC on arm. I still can't verify, if the optimized code works correctly on the Snapdragon X, but it at least does not break the compile/cmake-builds with clang/Windows or gcc/Linux. |
Just wanted to mention that our CI didn't catch this issue because something changed in the build environment of the Windows Action Runners. On the other hand the LLVM builds are perfectly fine. LLVM builds generate proper output for the Q4_0, and Q4_0_4_8 models. |
@AndreasKunar Before this ASM break I used to build with both MSVC and Clang. When building with Clang, targeting armv8.7-a it obliterates the performance of the MSVC build, because, for certain models, it uses the I8MM CPU feature (i.e. MATMUL). Running llama2, 3 and Gemma is so fast that I couldn't even believe it's running out of a CPU! I think the MSVC build is ~6tk/s and the Clang build (for armv8.7-a) is ~17tk/s on the lower size models running on a Snapdragon X Elite. |
@hmartinez82 - I did a PR #8477 for this, which only impacts ggml-aarch64.c conditional compilation (keeps the v8.7) and documentation. ggml-aarch64.c should never have compiled on MSVC on Snapdragon X. |
Yep. Known issue. We added MSVC ARM64 builds for completeness. |
What happened?
the asm directive in line 507 of ggml-aarch64.c
is dependant on: defined(__ARM_NEON) && defined(aarch64) which is tue for MSVC arm64 with the standard build scripts and does not compile with MSVC.
the other asm directives work out except for the one in line 1278, they are to not being used.
building with clang works, its MSVC specific.
Name and Version
llama.cpp version: 3378 (71c1121), Microsoft C/C++ Version 19.40.33812 for ARM64, Windows 11 Enterprise 24H2 Build 26100.1150
What operating system are you seeing the problem on?
Windows
Relevant log output
The text was updated successfully, but these errors were encountered: