Skip to content

ggml-cpu : split arch-specific implementations #13892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Jun 9, 2025
Merged

Conversation

xctan
Copy link
Collaborator

@xctan xctan commented May 29, 2025

This PR reworks the arch-specific code organization, following the discussion in #13720. Key changes include splitting the former ggml-cpu-quants.c and ggml-cpu-aarch64.cpp into several more focused files. Additionally, aarch64-related naming and identifiers, originally specific to AArch64 but now applicable to other architectures, have been updated to use repack to improve clarity and avoid confusion.

@xctan
Copy link
Collaborator Author

xctan commented May 29, 2025

I'll need a bit more time to get this PR rebased onto the latest master branch.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 29, 2025
@xctan xctan marked this pull request as ready for review May 30, 2025 21:06
@xctan
Copy link
Collaborator Author

xctan commented May 31, 2025

@ggerganov @slaren
Now that all CI checks have passed, could you please review this PR?

@ggerganov ggerganov self-requested a review June 4, 2025 12:19
Comment on lines 509 to 530

#define GGML_DO_PRAGMA_(x) _Pragma (#x)
#define GGML_DO_PRAGMA(x) GGML_DO_PRAGMA_(x)
#if defined(GGML_CPU_GENERIC) || defined(__HIPCC__)
// weak alias not working
# define GGML_WEAK_ALIAS(name, alias)
#elif defined(__GNUC__)
// GCC/Clang on *nix
# define GGML_WEAK_ALIAS(name, alias) GGML_DO_PRAGMA(weak name = alias) // NOLINT
#elif defined(_MSC_VER) && defined (_WIN64)
// MSVC
// Note: C name mangling varies across different calling conventions
// see https://learn.microsoft.com/en-us/cpp/build/reference/decorated-names?view=msvc-170
# define GGML_WEAK_ALIAS(name, alias) GGML_DO_PRAGMA(comment(linker, "/alternatename:" #name "=" #alias))
#else
# error "Unsupported compiler for GGML_WEAK_ALIAS"
#endif

#define GGML_CPU_NATIVE_IMPL(name) GGML_WEAK_ALIAS(name, name ## _generic)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"generic" implementations such as ggml_vec_dot_q4_0_q8_0_generic are always defined, regardless of the build configuration. And by making these definitions "weak", we allow them to be overwritten by arch-specific implementation.

Is my understanding of this logic correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on supported targets. The linker will first attempt to resolve implemented arch-specific symbols (like ggml_vec_dot_q4_0_q8_0). If an optimized version is not found, it automatically falls back to the _generic variants. I believe this approach offers improved maintainability at the cost of a slight increase in binary size due to these shadowed fallback implementations.

Comment on lines +667 to +687
#endif
for (; ib < nb; ++ib) {
int sumi0 = 0;
int sumi1 = 0;

for (int j = 0; j < qk/2; ++j) {
const int v0 = (x[ib].qs[j] & 0x0F) - 8;
const int v1 = (x[ib].qs[j] >> 4) - 8;

sumi0 += (v0 * y[ib].qs[j]);
sumi1 += (v1 * y[ib].qs[j + qk/2]);
}

int sumi = sumi0 + sumi1;
sumf += sumi*GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d);
}

*s = sumf;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, each arch re-implements the generic version. For this big refactoring this is probably the correct approach in order to minimize the risk of introducing bugs. But after we merge, we should consider ways to de-duplicate the scalar implementations by reusing the generic calls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be feasible, though it's not as straightforward for tail elements like the ones in this function. Let's keep the deduplication work for another PR to make future bisecting easier.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 4, 2025
@ggerganov
Copy link
Member

Noting that #12995, #13966 and #13996 would need to be reapplied - sorry about the conflict, I should have been more careful and wait before merging these. Let's first finish the review and then resolve these.

@ggerganov ggerganov requested a review from slaren June 6, 2025 14:38
@slaren slaren merged commit f470bc3 into ggml-org:master Jun 9, 2025
45 checks passed
@zhouwg
Copy link
Contributor

zhouwg commented Jun 10, 2025

@xctan, nice to meet you and sorry to bother you. I'm your fan. I know you are a highly-skilled student in China's No.1 university and did many excellent contributions for llama.cpp and other highly-difficult open-source projects(especially the RISC-V community).

here I have a minor question:

  1. the fp32 4096x4096 addition's performance via the default ggml backend on Qualcomm Snapdragon 8Elite based phone(e.g. Xiaomi15) is 11 milliseconds after I rebase the latest source codes in llama.cpp:

Screenshot from 2025-06-10 10-39-09

  1. the previous benchmark data of the default ggml backend is about 7-8 milliseconds, now the benchmark data of the default ggml backend is 11 milliseconds.

the previous benchmark data of the ggml-hexagon(via cDSP) is about 15 milliseconds, now is 6-7 milliseconds.
Screenshot from 2025-06-10 12-41-51
Screenshot from 2025-06-10 12-42-28
Screenshot from 2025-06-10 12-42-55
Screenshot from 2025-06-10 12-43-23
Screenshot from 2025-06-10 12-43-51

  1. how to reproduce
    pls refer to ggml-hexagon v0.97(v20250609) zhouwg/ggml-hexagon#46

  2. currently I don't know the reason(although obviously 11 milliseconds is a good news for my candidate PR) after I carefully checked this commit(I don't think it's caused by this commit), could you help to figure out the reason if you have time?

thanks too much.

@l15y
Copy link

l15y commented Jun 10, 2025

@xctan, nice to meet you and sorry to bother you. I'm your fan. I know you are a highly-skilled student in China's No.1 university and did many excellent contributions for llama.cpp and other highly-difficult open-source projects(especially the RISC-V community).

here I have a minor question:

  1. the fp32 4096x4096 addition's performance via the default ggml backend on Qualcomm Snapdragon 8Elite based phone(e.g. Xiaomi15) is 11 milliseconds after I rebase the latest source codes in llama.cpp:

Screenshot from 2025-06-10 10-39-09

  1. the previous benchmark data of the default ggml backend is about 7-8 milliseconds, now the benchmark data of the default ggml backend is 11 milliseconds.

the previous benchmark data of the ggml-hexagon(via cDSP) is about 15 milliseconds, now is 6-7 milliseconds. Screenshot from 2025-06-10 12-41-51 Screenshot from 2025-06-10 12-42-28 Screenshot from 2025-06-10 12-42-55 Screenshot from 2025-06-10 12-43-23 Screenshot from 2025-06-10 12-43-51

  1. how to reproduce
    pls refer to ggml-hexagon v0.97(v20250609) zhouwg/ggml-hexagon#46
  2. currently I don't know the reason(although obviously 11 milliseconds is a good news for my candidate PR) after I carefully checked this commit(I don't think it's caused by this commit), could you help to figure out the reason if you have time?

thanks too much.

Encountered a similar issue: the speed on ft-2000 is reduced by one fold.

@ggerganov
Copy link
Member

If you have noticed a regression, please open a new issue with repro steps.

@zhouwg
Copy link
Contributor

zhouwg commented Jun 10, 2025

@xctan, nice to meet you and sorry to bother you. I'm your fan. I know you are a highly-skilled student in China's No.1 university and did many excellent contributions for llama.cpp and other highly-difficult open-source projects(especially the RISC-V community).
here I have a minor question:

  1. the fp32 4096x4096 addition's performance via the default ggml backend on Qualcomm Snapdragon 8Elite based phone(e.g. Xiaomi15) is 11 milliseconds after I rebase the latest source codes in llama.cpp:

Screenshot from 2025-06-10 10-39-09

  1. the previous benchmark data of the default ggml backend is about 7-8 milliseconds, now the benchmark data of the default ggml backend is 11 milliseconds.

the previous benchmark data of the ggml-hexagon(via cDSP) is about 15 milliseconds, now is 6-7 milliseconds. Screenshot from 2025-06-10 12-41-51 Screenshot from 2025-06-10 12-42-28 Screenshot from 2025-06-10 12-42-55 Screenshot from 2025-06-10 12-43-23 Screenshot from 2025-06-10 12-43-51

  1. how to reproduce
    pls refer to ggml-hexagon v0.97(v20250609) zhouwg/ggml-hexagon#46
  2. currently I don't know the reason(although obviously 11 milliseconds is a good news for my candidate PR) after I carefully checked this commit(I don't think it's caused by this commit), could you help to figure out the reason if you have time?

thanks too much.

Encountered a similar issue: the speed on ft-2000 is reduced by one fold.

hello, nice to meet you, (1)I'm not AI expert and have made a stupid mistake because of my little AI knowledge, (2)I'm not familiar with FT-2000(China's self-developed desktop/server SoC) and also doesn't have such device. Is this new issue submitted/filed by you or me?

thanks.

@l15y
Copy link

l15y commented Jun 10, 2025

@xctan, nice to meet you and sorry to bother you. I'm your fan. I know you are a highly-skilled student in China's No.1 university and did many excellent contributions for llama.cpp and other highly-difficult open-source projects(especially the RISC-V community).
here I have a minor question:

  1. the fp32 4096x4096 addition's performance via the default ggml backend on Qualcomm Snapdragon 8Elite based phone(e.g. Xiaomi15) is 11 milliseconds after I rebase the latest source codes in llama.cpp:

Screenshot from 2025-06-10 10-39-09

  1. the previous benchmark data of the default ggml backend is about 7-8 milliseconds, now the benchmark data of the default ggml backend is 11 milliseconds.

the previous benchmark data of the ggml-hexagon(via cDSP) is about 15 milliseconds, now is 6-7 milliseconds. Screenshot from 2025-06-10 12-41-51 Screenshot from 2025-06-10 12-42-28 Screenshot from 2025-06-10 12-42-55 Screenshot from 2025-06-10 12-43-23 Screenshot from 2025-06-10 12-43-51

  1. how to reproduce
    pls refer to ggml-hexagon v0.97(v20250609) zhouwg/ggml-hexagon#46
  2. currently I don't know the reason(although obviously 11 milliseconds is a good news for my candidate PR) after I carefully checked this commit(I don't think it's caused by this commit), could you help to figure out the reason if you have time?

thanks too much.

Encountered a similar issue: the speed on ft-2000 is reduced by one fold.

hello, nice to meet you, (1)I'm not AI expert and have made a stupid mistake because of my little AI knowledge, (2)I'm not familiar with FT-2000(China's self-developed desktop/server SoC) and also doesn't have such device. Is this new issue submitted/filed by you or me?

thanks.

HI.
This new issue can be submitted by you, because the generation performance of FT-2000 is not important to me, I just did the test casually, and it is an outdated chip that cannot represent mainstream ARM level.

@xctan
Copy link
Collaborator Author

xctan commented Jun 10, 2025

Encountered a similar issue: the speed on ft-2000 is reduced by one fold.

Can you provide the dynamic symbol table in your build (e.g. output of objdump -T bin/libggml-cpu.so) when submitting your issue? Also, the performance may vary between executions, and I haven't observed steady and significant regression on my side.

@zhouwg
Copy link
Contributor

zhouwg commented Jun 10, 2025

@xctan, nice to meet you and sorry to bother you. I'm your fan. I know you are a highly-skilled student in China's No.1 university and did many excellent contributions for llama.cpp and other highly-difficult open-source projects(especially the RISC-V community).
here I have a minor question:

  1. the fp32 4096x4096 addition's performance via the default ggml backend on Qualcomm Snapdragon 8Elite based phone(e.g. Xiaomi15) is 11 milliseconds after I rebase the latest source codes in llama.cpp:

Screenshot from 2025-06-10 10-39-09

  1. the previous benchmark data of the default ggml backend is about 7-8 milliseconds, now the benchmark data of the default ggml backend is 11 milliseconds.

the previous benchmark data of the ggml-hexagon(via cDSP) is about 15 milliseconds, now is 6-7 milliseconds. Screenshot from 2025-06-10 12-41-51 Screenshot from 2025-06-10 12-42-28 Screenshot from 2025-06-10 12-42-55 Screenshot from 2025-06-10 12-43-23 Screenshot from 2025-06-10 12-43-51

  1. how to reproduce
    pls refer to ggml-hexagon v0.97(v20250609) zhouwg/ggml-hexagon#46
  2. currently I don't know the reason(although obviously 11 milliseconds is a good news for my candidate PR) after I carefully checked this commit(I don't think it's caused by this commit), could you help to figure out the reason if you have time?

thanks too much.

Encountered a similar issue: the speed on ft-2000 is reduced by one fold.

hello, nice to meet you, (1)I'm not AI expert and have made a stupid mistake because of my little AI knowledge, (2)I'm not familiar with FT-2000(China's self-developed desktop/server SoC) and also doesn't have such device. Is this new issue submitted/filed by you or me?
thanks.

HI. This new issue can be submitted by you, because the generation performance of FT-2000 is not important to me, I just did the test casually, and it is an outdated chip that cannot represent mainstream ARM level.

thanks for your reply.

  1. as I said before, I don't think that weird problem is caused by this commit after I carefully checked code changes in this commit.
    Screenshot from 2025-06-10 18-03-59
    Screenshot from 2025-06-10 17-59-13

  2. this very complicated commit is from a highly-skilled and highly-reputation developer(both good at system software development and AI tech).

  3. I don't want to make another stupid mistake although the performance of the default ggml backend on aarch64 is important to my candidate PR.

@barracuda156
Copy link
Contributor

This has broken the build on PowerPC: #14138

@zhouwg
Copy link
Contributor

zhouwg commented Jun 12, 2025

  1. firstly, I'd like to emphasize that I still don't think this commit brings performance issue(I think the problem might-be in other place if there is a real performance/regression issue in llama.cpp since June 1 2025).
  2. secondly, I'd like to re-post an interesting comment in my forked llama.cpp project: case-study: NPU performance comparison of GGML_OP_ADD between HWACCEL_CDSP and HWACCEL_QNN zhouwg/ggml-hexagon#34 (comment)
  • benchmark data of llama-bench on the default branch in that project (ggml-backend.cppv1.10 + ggmldspv0.97 + other scripts and code changes in llama.cpp):
    the default ggml backend on Snapdragon 8Elite:
    Screenshot from 2025-06-12 20-46-16
    Screenshot from 2025-06-12 21-09-10

  • benchmark data of llama-bench on the branch self-build-backup-20250531 in that project ( ggml-hexagon(ggml-backend.cppv1.10 + ggmldspv0.97 + other scripts and code changes in llama.cpp, 100% same as the default branch, backported by me manually and carefully):
    the default ggml backend on Snapdragon 8Elite
    Screenshot from 2025-06-12 21-48-27
    Screenshot from 2025-06-12 22-26-55

hope this weird problem will be suddenly disappeared tomorrow.

+whr819987540, @whr819987540, sorry to bother you, pls take a look if you have time, thanks!

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Jun 13, 2025
there is so much duplicate code in each cpu arch, i expect upstream will prune it eventually
arch detection has no fallback if all the arches are not found, by right we should set GGML_CPU_GENERIC
i should be relaxing its the weekend
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jun 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants