-
Notifications
You must be signed in to change notification settings - Fork 11.9k
SYCL: Add non contiguous support in RMS_NORM and NORM kernels #13611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
It now seems to pass with ne[0] = 1920
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you able to measure the impact on performance for this change? If it has one we may want to introduce different paths for contiguous and non-contiguous cases.
I'll try to check on my side for some relevant sizes at some point.
4f9b1bc
to
d5d39b5
Compare
Doesn't seem much different from master: [llama.cpp][master]$ build/bin/llama-bench -ngl 99 -m ~/Downloads/Weights/pythia-1.4b-q4_0.gguf
build: 92ecdcc (5423) [llama.cpp][sycl/non_cont_norms]$ build/bin/llama-bench -ngl 99 -m ~/Downloads/Weights/pythia-1.4b-q4_0.gguf
build: This PR |
I ran a couple of models and don't see any regressions there. It seems the changes don't affect the existing codepath (e2e):
|
ggml/src/ggml-sycl/norm.cpp
Outdated
if (block_size > WARP_SIZE) { | ||
int warp_id = tid / WARP_SIZE; | ||
int lane_id = tid % WARP_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was already in the existing code, but in case you'd like to change it to a more sycl like fashion, you can remove the if (block_size > WARP_SIZE)
and do
auto sub_group = item_ct1.get_sub_group();
auto sg_id = sub_group.get_group_linear_id();
auto wi_in_sg = sub_group.get_local_linear_id(); // And this is same as lane_id
You can refer https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sub-group-class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work with every ne[0] sizes:
NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000000): [NORM] inf mismatch: SYCL0=inf CPU=0.103443 FAIL
NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000000): OK
NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000000): [NORM] inf mismatch: SYCL0=inf CPU=1.217545 FAIL
NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000000): [NORM] inf mismatch: SYCL0=-inf CPU=-0.915058 FAIL
NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000001): [NORM] NMSE = 0.996613428 > 0.000000100 FAIL
NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000001): OK
NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000001): [NORM] NMSE = 0.996409533 > 0.000000100 FAIL
NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000001): [NORM] NMSE = 0.996592679 > 0.000000100 FAIL
NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000100): [NORM] NMSE = 0.966021524 > 0.000000100 FAIL
NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000100): OK
NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000100): [NORM] NMSE = 0.965700277 > 0.000000100 FAIL
NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000100): [NORM] NMSE = 0.965806282 > 0.000000100 FAIL
NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.100000): [NORM] NMSE = 0.282048698 > 0.000000100 FAIL
NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.100000): OK
NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.100000): [NORM] NMSE = 0.285614294 > 0.000000100 FAIL
NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.100000): [NORM] NMSE = 0.270144864 > 0.000000100 FAIL
RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000000): [RMS_NORM] inf mismatch: SYCL0=-inf CPU=-0.151891 FAIL
RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000000): OK
RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000000): [RMS_NORM] inf mismatch: SYCL0=inf CPU=1.350047 FAIL
RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000000): [RMS_NORM] inf mismatch: SYCL0=inf CPU=1.245914 FAIL
RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000001): [RMS_NORM] NMSE = 0.999657826 > 0.000000100 FAIL
RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000001): OK
RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000001): [RMS_NORM] NMSE = 0.999642137 > 0.000000100 FAIL
RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000001): [RMS_NORM] NMSE = 0.999654538 > 0.000000100 FAIL
RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000100): [RMS_NORM] NMSE = 0.996574387 > 0.000000100 FAIL
RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000100): OK
RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000100): [RMS_NORM] NMSE = 0.996428322 > 0.000000100 FAIL
RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000100): [RMS_NORM] NMSE = 0.996550262 > 0.000000100 FAIL
RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.100000): [RMS_NORM] NMSE = 0.892706561 > 0.000000100 FAIL
RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.100000): OK
RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.100000): [RMS_NORM] NMSE = 0.893457130 > 0.000000100 FAIL
RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.100000): [RMS_NORM] NMSE = 0.893824981 > 0.000000100 FAIL
Will try again tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately had to keep the if condition. block_size here is the work_group_size
taken from ggml_sycl_info().max_work_group_sizes[device];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine either way, but just FYI if that's the case then that condition will always evaluate to true, as the max_wg_size on a device will always be greater than the sub-group size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why removing the condition breaks all ne[0] = 64 inputs?🤔
I looked at some common sizes used with
PR:
There is a bit of noise but overall there is some overhead. From what we've seen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw @qnixsynapse out of curiosity are you aware of models needing this support or are you working on closing some gaps in the SYCL backend regardless of whether this is used today?
ggml/src/ggml-sycl/norm.cpp
Outdated
const int nwarps = nthreads / WARP_SIZE; | ||
|
||
x += sample*stride_sample + channel*stride_channel + row*stride_row; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect as we start adding support for non-contiguous layouts, this calculation would become more common, do you think it's worth adding similar to the following in the common.hpp file,
llama.cpp/ggml/src/ggml-sycl/binbcast.cpp
Line 42 in a4090d1
auto calculate_index = [](const std::array<int, 4> & dims, const std::array<int, 4> & strides, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be useful. There is also similar code for the non-contiguous batched mul_mats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in latest change.
@Rbiessy Ofcourse, closing some gaps in SYCL backend just in case. It is also possible to use different path/kernel for non contiguous inputs if it significantly effects performance. |
d5d39b5
to
37b9ba6
Compare
I think it's fine as it is. There is just one unresolved comment but otherwise looks good to me. |
This reverts commit 43be2d6.
ggml-ci
3ad3415
to
ea051b7
Compare
/* Helper for Computing the linear offset into an 4-dimensional ggml_tensor given | ||
per-dimension sizes, strides, and indices */ | ||
template<int N> | ||
static __dpct_inline__ size_t calculate_offset(const std::array<int, N> & strides, const std::array<int, N> & indices) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit but I don't think it makes sense to mark the function as static
in a header, it's just going to be ignored.
item_ct1.barrier(sycl::access::fence_space::local_space); | ||
mean_var = 0.f; | ||
size_t nreduce = nwarps / WARP_SIZE; | ||
const size_t nreduce = (nwarps + WARP_SIZE - 1) / WARP_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use ceil_div
here
Added non contiguous support in RMS_NORM and NORM kernels.
test-backend-ops
seems to pass with this change.Edit: restored logic for handling multi subgroup correctly which was not tested by test-backend-ops