Description
Feature Description
Hi the community, following the discussion #3965, we plan to contribute native SYCL backend to llama.cpp.
Motivation
Intel Arc series GPU provides accountable VRAM size and bandwidth, which the current OpenCL backend can't fully utilize especially on LLM. We expect a significant performance improvement with native SYCL backend.
References:
Possible Implementation
Native Kernels
We will implement the key operators of GGML in SYCL similar to the approach of supporting Metal and Vulkan. Basically, the steps are described as below:
- new backend; h2d & d2h
- oneMKL-dpcpp based FP32 & FP16 GEMM
- native SYCL kernels for de-quantization
- native SYCL kernels for other operators
Note:
Since llama.cpp has been evolving rapidly and new features will probably be supported through CUDA first, we plan to enable SYCLomatic to help migrate the code from CUDA to SYCL.
We plan to further introduce the template-based library e.g., XeTLA as mentioned in #3965 as the next stage, while we will be focusing on native SYCL support in this proposal.
Summary
We started working on native SYCL kernels and enabling SYCL backend in llama.cpp for Intel GPUs. Please feel free to drop a note. Thanks.