-
Notifications
You must be signed in to change notification settings - Fork 12.1k
llama : support GEGLU for jina-bert-v2 #14090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ggml-ci
src/llama-graph.cpp
Outdated
ggml_tensor * x0 = ggml_view_2d(ctx0, cur, split_point, cur->ne[1], cur->nb[1], 0); | ||
ggml_tensor * x1 = ggml_view_2d(ctx0, cur, split_point, cur->ne[1], cur->nb[1], split_point * ggml_element_size(cur)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the conts works because the ops will be offloaded back to the CPU if the tensor is not contiguous:
llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu
Lines 2972 to 2993 in 2bb0467
switch (op->op) { | |
case GGML_OP_UNARY: | |
switch (ggml_get_unary_op(op)) { | |
case GGML_UNARY_OP_ABS: | |
case GGML_UNARY_OP_SGN: | |
case GGML_UNARY_OP_NEG: | |
case GGML_UNARY_OP_STEP: | |
case GGML_UNARY_OP_GELU: | |
case GGML_UNARY_OP_SILU: | |
case GGML_UNARY_OP_RELU: | |
case GGML_UNARY_OP_SIGMOID: | |
case GGML_UNARY_OP_HARDSIGMOID: | |
case GGML_UNARY_OP_HARDSWISH: | |
case GGML_UNARY_OP_GELU_ERF: | |
case GGML_UNARY_OP_GELU_QUICK: | |
case GGML_UNARY_OP_TANH: | |
case GGML_UNARY_OP_EXP: | |
return ggml_is_contiguous(op->src[0]); | |
default: | |
return false; | |
} | |
break; |
For now lets keep the conts and expand the TODO to reference this comment, so that we update the backend implementations to support those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Just to make sure I'm on the right track here, I started looking into this yesterday, rather than (or maybe in addition to) making these ops handle non-contiguous data it seems much simpler (and more efficient) to add GEGLU/SWIGLU ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point.
* origin/master: llama : support GEGLU for jina-bert-v2 (ggml-org#14090) vulkan: force device 0 in CI (ggml-org#14106) Fixed spec timings to: accepted/tested instead of accepted/drafted (ggml-org#14104) sync : ggml ggml : fix weak alias win32 (whisper/0) Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (ggml-org#14099) rpc : nicer error messages for RPC server crash (ggml-org#14076) sync : ggml Add in-build ggml::ggml ALIAS library (ggml/1260) metal : use less stack memory in FA kernel (ggml-org#14088) kv-cache : fix shift and defrag logic (ggml-org#14081) llama : allow building all tests on windows when not using shared libs (ggml-org#13980)
Clean up conversion of
jina-bert-v2
models now that we haveLLM_FFN_GEGLU
, as we can easily support both old and newly converted models.Removedggml_cont
inLLM_FFN_SWIGLU
andLLM_FFN_GEGLU
, they are unnecessary.Edit: Tested withoutggml_cont
on CPU and CUDA withjina-reranker-v1
(GEGLU) andPhi-3.5-mini-instruct
(SWIGLU).