Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] remove global variables #7710

Merged
merged 10 commits into from
Jun 15, 2024
Merged

[SYCL] remove global variables #7710

merged 10 commits into from
Jun 15, 2024

Conversation

airMeng
Copy link
Collaborator

@airMeng airMeng commented Jun 3, 2024

Following #7566

Remaining

  • leave the async copy to common backend
  • separate GEMM files outside of ggml-sycl.c for maintainability.

@airMeng airMeng marked this pull request as draft June 3, 2024 08:07
@airMeng airMeng added refactoring Refactoring Intel GPU Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels Jun 3, 2024
@github-actions github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 3, 2024
@AidanBeltonS
Copy link
Contributor

Thanks for your refactoring! I'm aware this is a draft PR
However, just an FYI I have tested this branch and see some failures that are not on the tip. Im reporting them just in case it is helpful in your work

test-backend-ops on ARC A770 GPU

  DUP(type=f32,ne=[10,10,10,1]): GGML_ASSERT: /builds/perseus-performance-libraries/llama_ci/llama.cpp/ggml-sycl.cpp:12066: src0->backend == GGML_BACKEND_TYPE_GPU
/usr/bin/bash: line 193:  9395 Aborted                 (core dumped) ./bin/test-backend-ops -b SYCL0

llama-bench

$ ./bin/llama-bench -m /models/llama-2-13b-chat.Q4_0.gguf
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 3 SYCL devices:
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
found 3 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0|       [cuda:gpu:0]|                  NVIDIA A100-PCIE-40GB|    8.0|    108|    1024|   32| 42298M|            CUDA 12.2|
| 1|     [opencl:cpu:0]|     Intel Xeon Gold 6326 CPU @ 2.90GHz|    3.0|     32|    8192|   64| 67116M|2024.17.3.0.08_160000|
| 2|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     32|67108864|   64| 67116M|2024.17.3.0.08_160000|
GGML_ASSERT: /llama.cpp/ggml-sycl.cpp:13216: tensor->backend == GGML_BACKEND_TYPE_GPU
/usr/bin/bash: line 161:   230 Aborted                 (core dumped) ./bin/llama-bench -m /models/llama-2-13b-chat.Q4_0.gguf

@airMeng
Copy link
Collaborator Author

airMeng commented Jun 3, 2024

@AidanBeltonS Thank you for your reminder! I am aware current interaction between SYCL and common is not perfect, you can review the rough design.

Copy link
Contributor

github-actions bot commented Jun 3, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 542 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8624.57ms p(95)=21737.71ms fails=, finish reason: stop=483 truncated=59
  • Prompt processing (pp): avg=97.04tk/s p(95)=464.26tk/s
  • Token generation (tg): avg=32.97tk/s p(95)=48.81tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sycl-remove-global-variables commit=f32f17a781e908df90e184df08183a936139fcbf

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717592545 --> 1717593165
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 569.4, 569.4, 569.4, 569.4, 569.4, 745.11, 745.11, 745.11, 745.11, 745.11, 743.42, 743.42, 743.42, 743.42, 743.42, 760.01, 760.01, 760.01, 760.01, 760.01, 818.26, 818.26, 818.26, 818.26, 818.26, 843.48, 843.48, 843.48, 843.48, 843.48, 839.06, 839.06, 839.06, 839.06, 839.06, 862.82, 862.82, 862.82, 862.82, 862.82, 867.58, 867.58, 867.58, 867.58, 867.58, 877.97, 877.97, 877.97, 877.97, 877.97, 895.8, 895.8, 895.8, 895.8, 895.8, 858.24, 858.24, 858.24, 858.24, 858.24, 853.14, 853.14, 853.14, 853.14, 853.14, 805.84, 805.84, 805.84, 805.84, 805.84, 809.44, 809.44, 809.44, 809.44, 809.44, 811.59, 811.59, 811.59, 811.59, 811.59, 811.25, 811.25, 811.25, 811.25, 811.25, 836.07, 836.07, 836.07, 836.07, 836.07, 834.37, 834.37, 834.37, 834.37, 834.37, 834.17, 834.17, 834.17, 834.17, 834.17, 840.86, 840.86, 840.86, 840.86, 840.86, 836.29, 836.29, 836.29, 836.29, 836.29, 839.19, 839.19, 839.19, 839.19, 839.19, 843.97, 843.97, 843.97, 843.97, 843.97, 846.7, 846.7, 846.7, 846.7, 846.7, 847.92, 847.92, 847.92, 847.92, 847.92, 854.72, 854.72, 854.72, 854.72, 854.72, 854.78, 854.78, 854.78, 854.78, 854.78, 855.38, 855.38, 855.38, 855.38, 855.38, 860.3, 860.3, 860.3, 860.3, 860.3, 857.94, 857.94, 857.94, 857.94, 857.94, 856.36, 856.36, 856.36, 856.36, 856.36, 859.46, 859.46, 859.46, 859.46, 859.46, 869.63, 869.63, 869.63, 869.63, 869.63, 875.93, 875.93, 875.93, 875.93, 875.93, 882.06, 882.06, 882.06, 882.06, 882.06, 882.11, 882.11, 882.11, 882.11, 882.11, 881.93, 881.93, 881.93, 881.93, 881.93, 883.8, 883.8, 883.8, 883.8, 883.8, 885.46, 885.46, 885.46, 885.46, 885.46, 895.84, 895.84, 895.84, 895.84, 895.84, 899.44, 899.44, 899.44, 899.44, 899.44, 898.95, 898.95, 898.95, 898.95, 898.95, 896.63, 896.63, 896.63, 896.63, 896.63, 894.49, 894.49, 894.49, 894.49, 894.49, 892.86, 892.86, 892.86, 892.86, 892.86, 896.26, 896.26, 896.26, 896.26, 896.26, 897.62, 897.62, 897.62, 897.62, 897.62, 899.19, 899.19, 899.19, 899.19, 899.19, 901.08, 901.08, 901.08, 901.08, 901.08, 900.1, 900.1, 900.1, 900.1, 900.1, 899.77, 899.77, 899.77, 899.77, 899.77, 895.79, 895.79, 895.79, 895.79, 895.79, 894.72, 894.72, 894.72, 894.72, 894.72, 899.98, 899.98, 899.98, 899.98, 899.98, 899.71, 899.71, 899.71, 899.71, 899.71, 899.05, 899.05, 899.05, 899.05, 899.05, 899.19, 899.19, 899.19, 899.19, 899.19, 899.35, 899.35, 899.35, 899.35, 899.35, 899.65, 899.65, 899.65, 899.65, 899.65, 899.75]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717592545 --> 1717593165
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 47.38, 47.38, 47.38, 47.38, 47.38, 40.13, 40.13, 40.13, 40.13, 40.13, 27.59, 27.59, 27.59, 27.59, 27.59, 26.94, 26.94, 26.94, 26.94, 26.94, 28.33, 28.33, 28.33, 28.33, 28.33, 29.62, 29.62, 29.62, 29.62, 29.62, 31.39, 31.39, 31.39, 31.39, 31.39, 32.05, 32.05, 32.05, 32.05, 32.05, 32.91, 32.91, 32.91, 32.91, 32.91, 33.19, 33.19, 33.19, 33.19, 33.19, 33.25, 33.25, 33.25, 33.25, 33.25, 33.78, 33.78, 33.78, 33.78, 33.78, 33.4, 33.4, 33.4, 33.4, 33.4, 32.77, 32.77, 32.77, 32.77, 32.77, 31.48, 31.48, 31.48, 31.48, 31.48, 31.03, 31.03, 31.03, 31.03, 31.03, 30.59, 30.59, 30.59, 30.59, 30.59, 30.56, 30.56, 30.56, 30.56, 30.56, 30.04, 30.04, 30.04, 30.04, 30.04, 29.6, 29.6, 29.6, 29.6, 29.6, 29.6, 29.6, 29.6, 29.6, 29.6, 29.71, 29.71, 29.71, 29.71, 29.71, 30.04, 30.04, 30.04, 30.04, 30.04, 30.02, 30.02, 30.02, 30.02, 30.02, 30.24, 30.24, 30.24, 30.24, 30.24, 30.53, 30.53, 30.53, 30.53, 30.53, 30.42, 30.42, 30.42, 30.42, 30.42, 30.54, 30.54, 30.54, 30.54, 30.54, 30.85, 30.85, 30.85, 30.85, 30.85, 31.05, 31.05, 31.05, 31.05, 31.05, 31.16, 31.16, 31.16, 31.16, 31.16, 31.27, 31.27, 31.27, 31.27, 31.27, 31.5, 31.5, 31.5, 31.5, 31.5, 31.53, 31.53, 31.53, 31.53, 31.53, 31.22, 31.22, 31.22, 31.22, 31.22, 31.17, 31.17, 31.17, 31.17, 31.17, 30.86, 30.86, 30.86, 30.86, 30.86, 31.0, 31.0, 31.0, 31.0, 31.0, 31.23, 31.23, 31.23, 31.23, 31.23, 31.26, 31.26, 31.26, 31.26, 31.26, 31.45, 31.45, 31.45, 31.45, 31.45, 31.28, 31.28, 31.28, 31.28, 31.28, 31.09, 31.09, 31.09, 31.09, 31.09, 30.74, 30.74, 30.74, 30.74, 30.74, 29.36, 29.36, 29.36, 29.36, 29.36, 29.1, 29.1, 29.1, 29.1, 29.1, 29.13, 29.13, 29.13, 29.13, 29.13, 29.14, 29.14, 29.14, 29.14, 29.14, 29.3, 29.3, 29.3, 29.3, 29.3, 29.29, 29.29, 29.29, 29.29, 29.29, 29.32, 29.32, 29.32, 29.32, 29.32, 29.33, 29.33, 29.33, 29.33, 29.33, 29.24, 29.24, 29.24, 29.24, 29.24, 29.2, 29.2, 29.2, 29.2, 29.2, 29.17, 29.17, 29.17, 29.17, 29.17, 29.13, 29.13, 29.13, 29.13, 29.13, 29.24, 29.24, 29.24, 29.24, 29.24, 29.35, 29.35, 29.35, 29.35, 29.35, 29.42, 29.42, 29.42, 29.42, 29.42, 29.54, 29.54, 29.54, 29.54, 29.54, 29.64]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717592545 --> 1717593165
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.2, 0.2, 0.2, 0.2, 0.37, 0.37, 0.37, 0.37, 0.37, 0.3, 0.3, 0.3, 0.3, 0.3, 0.09, 0.09, 0.09, 0.09, 0.09, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.08, 0.08, 0.08, 0.08, 0.08, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.3, 0.3, 0.3, 0.3, 0.3, 0.27, 0.27, 0.27, 0.27, 0.27, 0.41, 0.41, 0.41, 0.41, 0.41, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.18, 0.18, 0.18, 0.18, 0.18, 0.31, 0.31, 0.31, 0.31, 0.31, 0.32, 0.32, 0.32, 0.32, 0.32, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.32, 0.32, 0.32, 0.32, 0.32, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.07, 0.07, 0.07, 0.07, 0.07, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.55, 0.55, 0.55, 0.55, 0.55, 0.61, 0.61, 0.61, 0.61, 0.61, 0.62, 0.62, 0.62, 0.62, 0.62, 0.34, 0.34, 0.34, 0.34, 0.34, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.29, 0.29, 0.29, 0.29, 0.29, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717592545 --> 1717593165
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0]
                    
Loading

@airMeng airMeng force-pushed the sycl-remove-global-variables branch from b49f1c0 to 8dfc5a7 Compare June 5, 2024 12:32
@airMeng airMeng marked this pull request as ready for review June 5, 2024 12:36
@airMeng airMeng force-pushed the sycl-remove-global-variables branch from cc8c48b to f32f17a Compare June 5, 2024 12:41
@airMeng
Copy link
Collaborator Author

airMeng commented Jun 5, 2024

@AidanBeltonS @NeoZhangJianyu now main.exe can work now. However, there seems to be huge performance regression which I am working on. You can review the design part now.

@NeoZhangJianyu
Copy link
Collaborator

@AidanBeltonS @NeoZhangJianyu now main.exe can work now. However, there seems to be huge performance regression which I am working on. You can review the design part now.

Yes.
I think this PR should reorganize the code. It shouldn't impact the performance.

Copy link
Contributor

@AidanBeltonS AidanBeltonS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! This is a big but necessary update. I have some minor comments

ggml-sycl.cpp Outdated Show resolved Hide resolved
ggml-sycl.cpp Outdated Show resolved Hide resolved
ggml-sycl.cpp Show resolved Hide resolved
@airMeng airMeng force-pushed the sycl-remove-global-variables branch from f32f17a to d342abc Compare June 13, 2024 03:49
@NeoZhangJianyu
Copy link
Collaborator

Still crash in multiple GPUs:

@airMeng
Copy link
Collaborator Author

airMeng commented Jun 14, 2024

@AidanBeltonS @NeoZhangJianyu This PR fixed SYCL broken since #7640 (comment) and I believe it solves #7777 and related, please have a try.

Known issues: multi-card support still broken

image

@AidanBeltonS
Copy link
Contributor

@AidanBeltonS @NeoZhangJianyu This PR fixed SYCL broken since #7640 (comment) and I believe it solves #7777 and related, please have a try.

I have tested on the A100 GPU and can confirm this fixes #7777

@airMeng airMeng merged commit 7b2f4a7 into master Jun 15, 2024
68 checks passed
@airMeng airMeng deleted the sycl-remove-global-variables branch June 15, 2024 06:05
@airMeng airMeng mentioned this pull request Jun 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning Intel GPU refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants