Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: ggml-aarch64.c does not compile on Windows ARM64 with MSVC #8446

Closed
AndreasKunar opened this issue Jul 12, 2024 · 12 comments · Fixed by #8531
Closed

Bug: ggml-aarch64.c does not compile on Windows ARM64 with MSVC #8446

AndreasKunar opened this issue Jul 12, 2024 · 12 comments · Fixed by #8531
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@AndreasKunar
Copy link
Contributor

AndreasKunar commented Jul 12, 2024

What happened?

the asm directive in line 507 of ggml-aarch64.c
is dependant on: defined(__ARM_NEON) && defined(aarch64) which is tue for MSVC arm64 with the standard build scripts and does not compile with MSVC.

the other asm directives work out except for the one in line 1278, they are to not being used.

building with clang works, its MSVC specific.

Name and Version

llama.cpp version: 3378 (71c1121), Microsoft C/C++ Version 19.40.33812 for ARM64, Windows 11 Enterprise 24H2 Build 26100.1150

What operating system are you seeing the problem on?

Windows

Relevant log output

build-instructions which fail:
cmake -B build
cmake --build build --config Release --target llama-cli

CMake build also fails for arm64-windows-msvc.cmake, but works for arm64-windows-llvm.cmake (clang compile instead of MSVC frontend works)

The build worked a few days ago with older llama.cpp versions.
@AndreasKunar AndreasKunar added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 12, 2024
@ggerganov
Copy link
Owner

@Dibakar Could you suggest a fix?

@AndreasKunar
Copy link
Contributor Author

AndreasKunar commented Jul 12, 2024

@Dibakar Could you suggest a fix?

MSVC does not support any asm for x64 or arm64.
A quick fix for me was to append "&& ! defined(_MSC_VER)" to the conditionals in line 502 and 1272 of ggml-aarch64.c. Then it compiles and llama-cli seems to work. Will benchmark, if can see a performance difference betwen clang and MSVC.

@AndreasKunar
Copy link
Contributor Author

FYI - with llama-bench for llama-2-7b-chat.Q4_0.gguf there was no measurable performance difference between the clang and MSVC (+my quick-fix) compiled variants.

@Dibakar
Copy link
Contributor

Dibakar commented Jul 12, 2024

For clang, to use the aarch64 optimized kernels for Q4_0 from ggml-aarch64.c, we need to first convert (requantize) Q4_0.gguf to Q4_0_4_8.gguf (or Q4_0_4_4.gguf) before using a cli. I am not sure if this is what you wanted. Otherwise, the use of llama-2-7b-chat.Q4_0.gguf (Q4_0 type) will not touch the code from ggml-aarch64.c and and performance will remain unchanged.

Below is the command line to requantize a Q4_0.gguf first. This enables the Q4_0_4_8 quant type to use the Q4_0_4_8 assembly kernel from ggml-aarch64.c. Q4_0_4_8 quant type is fine if the Windows hardware supports both ARM_NEON and ARM_MATMUL_INT8. Otherwise, the Q4_0_4_4 type should be used, which only needs ARM_NEON support. If you need to use Q4_0_4_4, then build using cmake -B build -DGGML_LLAMAFILE=OFF.

./llama-quantize --allow-requantize llama-2-7b-chat.Q4_0.gguf llama-2-7b-chat.Q4_0_4_8.gguf Q4_0_4_8

@AndreasKunar
Copy link
Contributor Author

For clang, to use the aarch64 optimized kernels for Q4_0 from ggml-aarch64.c, we need to first convert (requantize) Q4_0.gguf to Q4_0_4_8.gguf (or Q4_0_4_4.gguf) before using a cli. I am not sure if this is what you wanted. Otherwise, the use of llama-2-7b-chat.Q4_0.gguf (Q4_0 type) will not touch the code from ggml-aarch64.c and and performance will remain unchanged.

Below is the command line to requantize a Q4_0.gguf first. This enables the Q4_0_4_8 quant type to use the Q4_0_4_8 assembly kernel from ggml-aarch64.c. Q4_0_4_8 quant type is fine if the Windows hardware supports both ARM_NEON and ARM_MATMUL_INT8. Otherwise, the Q4_0_4_4 type should be used, which only needs ARM_NEON support. If you need to use Q4_0_4_4, then build using cmake -B build -DGGML_LLAMAFILE=OFF.

./llama-quantize --allow-requantize llama-2-7b-chat.Q4_0.gguf llama-2-7b-chat.Q4_0_4_8.gguf Q4_0_4_8

@Dibakar I just wanted to compile with MSVC arm64, and your PR now breaks all the llama.cpp build-scripts with MSVC arm64, because you did not exclude MSVC before inlining asm. I suggest to update your ggml-aarch64.c with appending "&& ! defined(_MSC_VER)" to the conditionals in line 502 and 1272.

Thanks a lot for the clang optimization clarification, I will try it later to see if it improves Snapdragon X Windows performance.

@AndreasKunar
Copy link
Contributor Author

@Dibakar - any progress on this (the bug is still "unconfirmed")? This (ggml-aarch64.c) breaks all llama.cpp Windows on Arm cmake builds with MSVC and there is an easy fix with e.g. a conditional-compile condition for excluding MSVC from all _asm_ (see above).

@ggerganov
Copy link
Owner

an easy fix with e.g. a conditional-compile condition for excluding MSVC from all asm (see above).

It would be better to find a way to inline assembly code with MSVC instead of disabling it all together

@AndreasKunar
Copy link
Contributor Author

an easy fix with e.g. a conditional-compile condition for excluding MSVC from all asm (see above).

It would be better to find a way to inline assembly code with MSVC instead of disabling it all together

@ggerganov - re inline-assembly: MSVC only supports inline-assembly with 32-bit x86. For arm64, we would need to re-code all _asm_ into intrinsics for MSVC. These intrinsics are beyond my capabilites, and I don't think they e.g. cover the .inst binary instruction-codes in ggml-aarch64.c line 426... For MSVC-builds it would need to be direct assembly code sources.

I also could not yet get the Q4_0_4_4 and Q4_0_4_8 requantized llama-2 models to work on SnapdragonX with the optimized code. Neither with Windows/clang nor Linux/gcc - both build OK. llama-cli with it is 10x faster but generates totally wrong token-results/gibberish or throws an error. Maybe my re-quantization was bad, or I made some other build error, or the Snapdragon X does not support some of the features the llama.cpp common cmake scripts (arm64-windows-llvm.cmake,...) indicate. I also could not yet get it to work on macOS on my M2. So I'm still looking into first getting it to work reliably on macOS/M2, then on clang/Windows/SnapdragonX and gcc/WSL2+Ubuntu24.04/SnapdragonX. I will raise an issue, if I cannot get it to run reliably on Snapdragon X, after it works for me on the M2.

I strongly suggest for now to just disable the _asm_ generation with MSVC on arm for ggml-aarch64.c and fall-back on the slower code or error with MSVC on arm. I still can't verify, if the optimized code works correctly on the Snapdragon X, but it at least does not break the compile/cmake-builds with clang/Windows or gcc/Linux.

@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Jul 17, 2024

Just wanted to mention that our CI didn't catch this issue because something changed in the build environment of the Windows Action Runners.
The MSVC builld job now picks up GCC:
https://github.com/ggerganov/llama.cpp/actions/runs/9973317856/job/27558398158
-- Check for working C compiler: C:/mingw64/bin/gcc.exe
That wasn't the case after my original PR to enable optimized builds for X-Elite.
I'm looking into this.

On the other hand the LLVM builds are perfectly fine.
https://github.com/ggerganov/llama.cpp/actions/runs/9973317856/job/27558397938
There are some warnings about asm string literals but mostly benign.

LLVM builds generate proper output for the Q4_0, and Q4_0_4_8 models.
X-Elite does support all required instructions for that and the CMake presets/toolchain are correct.

@hmartinez82
Copy link

hmartinez82 commented Jul 19, 2024

For clang, to use the aarch64 optimized kernels for Q4_0 from ggml-aarch64.c, we need to first convert (requantize) Q4_0.gguf to Q4_0_4_8.gguf (or Q4_0_4_4.gguf) before using a cli. I am not sure if this is what you wanted. Otherwise, the use of llama-2-7b-chat.Q4_0.gguf (Q4_0 type) will not touch the code from ggml-aarch64.c and and performance will remain unchanged.
Below is the command line to requantize a Q4_0.gguf first. This enables the Q4_0_4_8 quant type to use the Q4_0_4_8 assembly kernel from ggml-aarch64.c. Q4_0_4_8 quant type is fine if the Windows hardware supports both ARM_NEON and ARM_MATMUL_INT8. Otherwise, the Q4_0_4_4 type should be used, which only needs ARM_NEON support. If you need to use Q4_0_4_4, then build using cmake -B build -DGGML_LLAMAFILE=OFF.
./llama-quantize --allow-requantize llama-2-7b-chat.Q4_0.gguf llama-2-7b-chat.Q4_0_4_8.gguf Q4_0_4_8

@Dibakar I just wanted to compile with MSVC arm64, and your PR now breaks all the llama.cpp build-scripts with MSVC arm64, because you did not exclude MSVC before inlining asm. I suggest to update your ggml-aarch64.c with appending "&& ! defined(_MSC_VER)" to the conditionals in line 502 and 1272.

Thanks a lot for the clang optimization clarification, I will try it later to see if it improves Snapdragon X Windows performance.

@AndreasKunar Before this ASM break I used to build with both MSVC and Clang. When building with Clang, targeting armv8.7-a it obliterates the performance of the MSVC build, because, for certain models, it uses the I8MM CPU feature (i.e. MATMUL). Running llama2, 3 and Gemma is so fast that I couldn't even believe it's running out of a CPU! I think the MSVC build is ~6tk/s and the Clang build (for armv8.7-a) is ~17tk/s on the lower size models running on a Snapdragon X Elite.

@AndreasKunar
Copy link
Contributor Author

AndreasKunar commented Jul 19, 2024

For clang, to use the aarch64 optimized kernels for Q4_0 from ggml-aarch64.c, we need to first convert (requantize) Q4_0.gguf to Q4_0_4_8.gguf (or Q4_0_4_4.gguf) before using a cli. I am not sure if this is what you wanted. Otherwise, the use of llama-2-7b-chat.Q4_0.gguf (Q4_0 type) will not touch the code from ggml-aarch64.c and and performance will remain unchanged.
Below is the command line to requantize a Q4_0.gguf first. This enables the Q4_0_4_8 quant type to use the Q4_0_4_8 assembly kernel from ggml-aarch64.c. Q4_0_4_8 quant type is fine if the Windows hardware supports both ARM_NEON and ARM_MATMUL_INT8. Otherwise, the Q4_0_4_4 type should be used, which only needs ARM_NEON support. If you need to use Q4_0_4_4, then build using cmake -B build -DGGML_LLAMAFILE=OFF.
./llama-quantize --allow-requantize llama-2-7b-chat.Q4_0.gguf llama-2-7b-chat.Q4_0_4_8.gguf Q4_0_4_8

@Dibakar I just wanted to compile with MSVC arm64, and your PR now breaks all the llama.cpp build-scripts with MSVC arm64, because you did not exclude MSVC before inlining asm. I suggest to update your ggml-aarch64.c with appending "&& ! defined(_MSC_VER)" to the conditionals in line 502 and 1272.
Thanks a lot for the clang optimization clarification, I will try it later to see if it improves Snapdragon X Windows performance.

@AndreasKunar Before this ASM break I used to build with both MSVC and Clang. When building with Clang, targeting armv8.7-a it obliterates the performance of the MSVC build, because, for certain models, it uses the I8MM CPU feature (i.e. MATMUL). Running llama2, 3 and Gemma is so fast that I couldn't even believe it's running out of a CPU! I think the MSVC build is ~6tk/s and the Clang build (for armv8.7-a) is ~17tk/s on the lower size models running on a Snapdragon X Elite.

@hmartinez82 - I did a PR #8477 for this, which only impacts ggml-aarch64.c conditional compilation (keeps the v8.7) and documentation. ggml-aarch64.c should never have compiled on MSVC on Snapdragon X.

@max-krasnyansky
Copy link
Collaborator

@Dibakar I just wanted to compile with MSVC arm64, and your PR now breaks all the llama.cpp build-scripts with MSVC arm64, because you did not exclude MSVC before inlining asm. I suggest to update your ggml-aarch64.c with appending "&& ! defined(_MSC_VER)" to the conditionals in line 502 and 1272.
Thanks a lot for the clang optimization clarification, I will try it later to see if it improves Snapdragon X Windows performance.

@AndreasKunar Before this ASM break I used to build with both MSVC and Clang. When building with Clang, targeting armv8.7-a it obliterates the performance of the MSVC build, because, for certain models, it uses the I8MM CPU feature (i.e. MATMUL). Running llama2, 3 and Gemma is so fast that I couldn't even believe it's running out of a CPU! I think the MSVC build is ~6tk/s and the Clang build (for armv8.7-a) is ~17tk/s on the lower size models running on a Snapdragon X Elite.

Yep. Known issue. We added MSVC ARM64 builds for completeness.
LLVM/Clang (either standalone or the one that comes with Visual Studio 2022) should be used for performance and is generally a better default these days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants