Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Add SYCL Backend Support for Intel GPUs #330

Merged
merged 7 commits into from
Aug 10, 2024

Conversation

zhentaoyu
Copy link
Contributor

@zhentaoyu zhentaoyu commented Aug 5, 2024

What Does This PR Do

#308
Hi, it's a pr for initial SYCL backend support for Intel GPUs (Arch-770, etc.). We are happy to hear the feedback from stable-diffusion.cpp community. If you have any comments or questions, please feel free to leave them in this pr.
cc @airMeng, @luoyu-intel, @hshen14

TODO or Issues

Test Results of SYCL

build command

cmake .. -DSD_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
doc link is here

test machine

ID Device Type Name Version Max compute units Max work group Max sub group Global mem size Driver version
0 [level_zero:gpu:0] Intel Data Center GPU Max 1100 1.3 448 1024 32 51539M 1.3.27191

text2img

  • sd v1-4
    ./build/bin/sd -m ../sd_models/sd-v1-4.ckpt -p "a lovely cat" -o "output_sycl_sd1-4.png"
    output: [INFO ] stable-diffusion.cpp:1328 - txt2img completed in 8.35s

image

  • sd v1-5
    ./build/bin/sd -m ../sd_models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" -o "output_sycl_sd1-5.png"
    output: [INFO ] stable-diffusion.cpp:1328 - txt2img completed in 8.44s

image

  • sd v2
    ./build/bin/sd -m ../sd_models/v2-1_768-nonema-pruned.safetensors -p "a lovely cat" -o "output_sycl_sd2.png"
    output: [INFO ] stable-diffusion.cpp:1328 - txt2img completed in 6.04s
    image

  • sd xl
    ./build/bin/sd -m ../sd_models/sdxl/sd_xl_base_1.0.safetensors --vae ../sd_models/sdxl/sdxl_vae.safetensors -H 512 -W 512 -p "a lovely cat" -o "output_sycl_sdxl.png" --seed 16
    output: [INFO ] stable-diffusion.cpp:1328 - txt2img completed in 12.46s
    image

  • sd v3
    ./build/bin/sd -m ../sd_models/sd3_medium_incl_clips_t5xxlfp16.safetensors -H 512 -W 512 -p "a lovely cat holding a sign says \"Stable Diffusion CPP\"" --cfg-scale 4.5 --sampling-method euler -v -o "output_sycl_sd3.png"
    output: [INFO ] stable-diffusion.cpp:1328 - txt2img completed in 13.25s
    image

quantization type

command: ./build/bin/sd -m ../sd_models/sd3_medium_incl_clips_t5xxlfp16.safetensors -H 512 -W 512 -p "a lovely cat holding a sign says \"Stable Diffusion CPP\"" --cfg-scale 4.5 --sampling-method euler -v -o "output_sycl_sd3_q8_0.png" --type q8_0

f16 q8_0 q5_0 q5_1 q4_0 q4_1
13.25s 12.84s 12.88s 12.91s 12.86s 12.99s
14829.53MB 8030.57MB 5310.99MB 5764.25MB 4404.46MB 4857.73MB
image image image image image image

img2img

command: ./build/bin/sd --mode img2img -m ../sd_models/sd3_medium_incl_clips_t5xxlfp16.safetensors -p "cat with blue eyes" -i ./assets/f32.png --strength 0.4 -o "output_sycl_img2img.png"

output:

[INFO ] stable-diffusion.cpp:1172 - generating image: 1/1 - seed 42
  |==================================================| 9/9 - 2.37it/s
[INFO ] stable-diffusion.cpp:1203 - sampling completed, taking 4.21s
[INFO ] stable-diffusion.cpp:1211 - generating 1 latent images completed, taking 4.58s
[INFO ] stable-diffusion.cpp:1214 - decoding 1 latents
[INFO ] stable-diffusion.cpp:1224 - latent 1 decoded, taking 0.75s
[INFO ] stable-diffusion.cpp:1228 - decode_first_stage completed, taking 0.75s
[INFO ] stable-diffusion.cpp:1424 - img2img completed in 1.87s

image

LCM-Lora

command: ./build/bin/sd -m ../sd_models/v1-5-pruned-emaonly.safetensors -p "a lovely cat<lora:lcm-lora-sdv1-5:1>" --steps 4 --lora-model-dir ../sd_models/lora/ -v --cfg-scale 1 -o "output_sycl_lcm_lora.png"
output: [INFO ] stable-diffusion.cpp:1328 - txt2img completed in 4.08s
image

PhotoMaker

command: ./build/bin/sd -m ../sd_models/sdxl/sdxlUnstableDiffusers_v11.safetensors --vae ../sd_models/sdxl/sdxl_vae.safetensors --stacked-id-embd-dir ../sd_models/photo_maker/photomaker-v1.safetensors --input-id-images-dir ./assets/photomaker_examples/scarletthead_woman/ -p "a girl img, retro futurism, retro game art style but extremely beautiful, intricate details, masterpiece, best quality, space-themed, cosmic, celestial, stars, galaxies, nebulas, planets, science fiction, highly detailed" -n "realistic, photo-realistic, worst quality, greyscale, bad anatomy, bad hands, error, text" --cfg-scale 5.0 --sampling-method euler -H 640 -W 640 --style-ratio 15 -o "output_sycl_photomaker.png"

output: [INFO ] stable-diffusion.cpp:1328 - txt2img completed in 17.38s
image

Upscale

command: ./build/bin/sd -m ../sd_models/sd3_medium_incl_clips_t5xxlfp16.safetensors -H 640 -W 640 -p "a lovely cat holding a sign says \"Stable Diffusion CPP\"" --cfg-scale 4.5 --sampling-method euler --upscale-model ../sd_models/upscale/RealESRGAN_x4plus_anime_6B.pth -v -o "output_sycl_sd3_upscale.png" --seed 100

output:

[INFO ] stable-diffusion.cpp:1328 - txt2img completed in 17.99s
[DEBUG] upscaler.cpp:28   - Using SYCL backend
[INFO ] upscaler.cpp:36   - Upscaler weight type: f16
[INFO ] esrgan.hpp:156  - loading esrgan from '../sd_models/upscale/RealESRGAN_x4plus_anime_6B.pth'
[DEBUG] ggml_extend.hpp:988  - esrgan params backend buffer size =   8.53 MB(VRAM) (192 tensors)
[INFO ] model.cpp:740  - load ../sd_models/upscale/RealESRGAN_x4plus_anime_6B.pth using checkpoint format
[DEBUG] model.cpp:1258 - init from '../sd_models/upscale/RealESRGAN_x4plus_anime_6B.pth'
[DEBUG] model.cpp:1389 - loading tensors from ../sd_models/upscale/RealESRGAN_x4plus_anime_6B.pth
[INFO ] esrgan.hpp:175  - esrgan model loaded
[INFO ] upscaler.cpp:50   - upscaling from (640 x 640) to (2560 x 2560)
[DEBUG] upscaler.cpp:64   - upscale work buffer size: 150.00 MB
[DEBUG] ggml_extend.hpp:496  - tile work buffer size: 3.19 MB
[DEBUG] ggml_extend.hpp:939  - esrgan compute buffer size: 416.00 MB(VRAM)
[INFO ] ggml_extend.hpp:510  - processing 49 tiles
  |==================================================| 49/49 - 6.99it/s
[INFO ] upscaler.cpp:79   - input_image_tensor upscaled, taking 7.37s

image

TAESD

command: ./build/bin/sd -m ../sd_models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --taesd ../sd_models/taesd/diffusion_pytorch_model.safetensors -o "output_sycl_taesd.png"
output: [INFO ] stable-diffusion.cpp:1328 - txt2img completed in 7.80s
image

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
@luoyu-intel
Copy link

Do we have any reference performance data? like, what is the latency of A100 for each model.

@zhentaoyu
Copy link
Contributor Author

Do we have any reference performance data? like, what is the latency of A100 for each model.

I only have A30 for now. Will update a table soon.

Copy link

@Nuullll Nuullll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I happened to be working on this recently.
LGTM! The code change is basically identical with my local version :-P

FP16 performance is bad -- it's even worse than using fp32. We can address that later.

@zhentaoyu
Copy link
Contributor Author

zhentaoyu commented Aug 5, 2024

Test commands are from the text2img section of the PR description.

model GPU Max 1100 (total sec)
sd1-4 8.35
sd1-5 8.44
sd2 6.04
sd_xl 12.46
sd3 13.25

@Green-Sky
Copy link
Contributor

Green-Sky commented Aug 5, 2024

Can you please add units to the performance data? until stated, it can be both it/s and s/it, as well as total sec ...

@zhentaoyu
Copy link
Contributor Author

Can you please add units to the performance data? until stated, it can be both it/s and s/it, as well as total sec ...

sorry, added.

@airMeng
Copy link

airMeng commented Aug 8, 2024

@leejet anything else needed to be merged?

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
@zhentaoyu zhentaoyu marked this pull request as ready for review August 9, 2024 02:02
@zhentaoyu
Copy link
Contributor Author

This PR is ready for review now. There still has an issue that sycl backend can not generate images with large H and W. Will try to fix it in a later PR (the main efforts are from ggml side).

@zhentaoyu
Copy link
Contributor Author

@leejet, cloud you please take a look at this PR if you have free time? Thanks a lot.

@leejet
Copy link
Owner

leejet commented Aug 10, 2024

Excellent work! Thank you for your contribution.

@leejet leejet merged commit 697d000 into leejet:master Aug 10, 2024
8 checks passed
@cheeseng
Copy link

I tried to use my intel gpu with this, able to build successfully but when try to run I got the following error:

[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
can not find preferred GPU platform
Exception caught at file:/home/cheeseng/stable-diffusion.cpp/ggml/src/ggml-sycl.cpp, line:1859, func:operator()
terminate called after throwing an instance of 'std::runtime_error'
  what():  can not find preferred GPU platform
Aborted (core dumped)

Is there anything I am missed? Here's my gpu info:

~$ lspci -k | grep -EA3 'VGA|3D|Display'
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
	DeviceName: Onboard - Video
	Subsystem: Tongfang Hongkong Limited CoffeeLake-H GT2 [UHD Graphics 630]
	Kernel driver in use: i915
--
01:00.0 VGA compatible controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] (rev a1)
	Subsystem: Tongfang Hongkong Limited GP107M [GeForce GTX 1050 Mobile]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

@zhentaoyu
Copy link
Contributor Author

I tried to use my intel gpu with this, able to build successfully but when try to run I got the following error:

[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
can not find preferred GPU platform
Exception caught at file:/home/cheeseng/stable-diffusion.cpp/ggml/src/ggml-sycl.cpp, line:1859, func:operator()
terminate called after throwing an instance of 'std::runtime_error'
  what():  can not find preferred GPU platform
Aborted (core dumped)

Is there anything I am missed? Here's my gpu info:

~$ lspci -k | grep -EA3 'VGA|3D|Display'
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
	DeviceName: Onboard - Video
	Subsystem: Tongfang Hongkong Limited CoffeeLake-H GT2 [UHD Graphics 630]
	Kernel driver in use: i915
--
01:00.0 VGA compatible controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] (rev a1)
	Subsystem: Tongfang Hongkong Limited GP107M [GeForce GTX 1050 Mobile]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Hi, @cheeseng, does sycl-ls print anything?

@cheeseng
Copy link

I tried to use my intel gpu with this, able to build successfully but when try to run I got the following error:

[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
can not find preferred GPU platform
Exception caught at file:/home/cheeseng/stable-diffusion.cpp/ggml/src/ggml-sycl.cpp, line:1859, func:operator()
terminate called after throwing an instance of 'std::runtime_error'
  what():  can not find preferred GPU platform
Aborted (core dumped)

Is there anything I am missed? Here's my gpu info:

~$ lspci -k | grep -EA3 'VGA|3D|Display'
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
	DeviceName: Onboard - Video
	Subsystem: Tongfang Hongkong Limited CoffeeLake-H GT2 [UHD Graphics 630]
	Kernel driver in use: i915
--
01:00.0 VGA compatible controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] (rev a1)
	Subsystem: Tongfang Hongkong Limited GP107M [GeForce GTX 1050 Mobile]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Hi, @cheeseng, does sycl-ls print anything?

@zhentaoyu Sorry I just noticed this, here's my sycl-ls output:

~$ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]

@SkutteOleg SkutteOleg mentioned this pull request Aug 20, 2024
@zhentaoyu
Copy link
Contributor Author

I tried to use my intel gpu with this, able to build successfully but when try to run I got the following error:

[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
can not find preferred GPU platform
Exception caught at file:/home/cheeseng/stable-diffusion.cpp/ggml/src/ggml-sycl.cpp, line:1859, func:operator()
terminate called after throwing an instance of 'std::runtime_error'
  what():  can not find preferred GPU platform
Aborted (core dumped)

Is there anything I am missed? Here's my gpu info:

~$ lspci -k | grep -EA3 'VGA|3D|Display'
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
	DeviceName: Onboard - Video
	Subsystem: Tongfang Hongkong Limited CoffeeLake-H GT2 [UHD Graphics 630]
	Kernel driver in use: i915
--
01:00.0 VGA compatible controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] (rev a1)
	Subsystem: Tongfang Hongkong Limited GP107M [GeForce GTX 1050 Mobile]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Hi, @cheeseng, does sycl-ls print anything?

@zhentaoyu Sorry I just noticed this, here's my sycl-ls output:

~$ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]

Hi, @cheeseng, sycl-ls does not show any level_zero:gpu device. You can refer to this doc for more information.

SkutteOleg pushed a commit to SkutteOleg/stable-diffusion.cpp that referenced this pull request Aug 21, 2024
* update ggml and add SYCL CMake option

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* hacky CMakeLists.txt for updating ggml in cpu backend

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* rebase and clean code

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* add sycl in README

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* rebase ggml commit

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* refine README

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* update ggml for supporting sycl tsembd op

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
SkutteOleg pushed a commit to SkutteOleg/stable-diffusion.cpp that referenced this pull request Aug 21, 2024
* update ggml and add SYCL CMake option

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* hacky CMakeLists.txt for updating ggml in cpu backend

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* rebase and clean code

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* add sycl in README

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* rebase ggml commit

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* refine README

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* update ggml for supporting sycl tsembd op

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
SkutteOleg pushed a commit to SkutteOleg/stable-diffusion.cpp that referenced this pull request Aug 24, 2024
* update ggml and add SYCL CMake option

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* hacky CMakeLists.txt for updating ggml in cpu backend

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* rebase and clean code

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* add sycl in README

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* rebase ggml commit

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* refine README

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* update ggml for supporting sycl tsembd op

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants