-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Accelerate for vector scale #193
Comments
I naively thought adding "if defined" on the top and setting the type correctly for 'vDSP_vsmul' would solve the problem easily. But when I modify the code like the following, I get segmentation fault. What do you think is the problem?
|
Try narrowing that into a standalone C++ or Swift program. Does the fault still happen? Btw I think you could vastly improve the softmax part by writing vectorized code that fuses each kernel call. Calling into Accelerate this way makes it memory bound with most time spent reading and writing everything from L1. |
I don’t know why, but LLaMa.cpp is much slower than it should theoretically be. Going by @ggerganov’s CPU bandwidth (200 GB/s), the CPU cores should eat the entire 6.7B-q4.5 model in 16 ms. But for some reason the token latency is 43 ms. That’s a 2-3x speed up we could have by redesigning the code, not just an incremental speed up. |
This number is Apple's claim for the memory bandwidth of M1 Pro if I remember correctly. And regarding single thread, it's no more than 40GB/s |
Single thread has reached 100 GB/s in some benchmarks. When it’s occupied with other work or code is improperly written, it can’t utilize all of that. But then there are 8 cores total to harness that bandwidth. On GPU (M1 Max), I have achieved 378 GB/s out of 400 GB/s in a custom Metal blit command. It requires careful tuning - aligning the data structure to 64B boundaries. From what I can tell, LLaMa.cpp is not aligned. Going so far as to shuffle data around in threadgroup memory, just so whatever it eats and spits out is 64B aligned: |
Shame! it should be
No segment fault anymore! |
I ran several times but SIMD tends to be faster.
output of GGML_SIMD
output using vDSP_vsmul,
|
Try replacing some other Accelerate calls with vectorized code. Bonus if you can fuse two elementwise operations of the softmax without writing the element back to memory in between. // NOTE: Softmax is expected to consume the most time, due to the latency of
// each function call and inability to keep the elements in registers.
// Consider writing vectorized Swift code for a fairer comparison to GPU.
// Pseudocode for softmax operation:
// (1) find maximum element in each row
// (2) subtract the maximum from all elements
// (3) apply the exponential operator to all elements
// (4) find the sum of each row
// (5) divide all elements by the sum
for i in 0..<UInt(NQ) {
// The elements to operate on.
let n = UInt(NKV)
let row = _QK + Int(i * n)
// (1)
var maxValue: Float = 0
vDSP_maxv(row, 1, &maxValue, n)
assert(maxValue != 0)
// (2)
maxValue = -maxValue
vDSP_vsadd(row, 1, &maxValue, row, 1, n)
// (3)
vvexpf(row, row, &NKV)
// (4)
var sumValue: Float = 0
vDSP_sve(row, 1, &sumValue, n)
// (5)
sumValue = simd_precise_recip(sumValue)
vDSP_vsmul(row, 1, &sumValue, row, 1, n)
} Becomes for i in 0..<UInt(NQ) {
// The elements to operate on.
let n = UInt(NKV)
let row = _QK + Int(i * n)
// (1)
var maxValue: Float = 0
vDSP_maxv(row, 1, &maxValue, n)
assert(maxValue != 0)
// PSEUDOCODE STARTS
typealias Vector = SIMD16<Float> // Try multiple vector lengths.
var sumValueVec: Vector = .zero
for i in 0..<n / Vector.elementCount { // TODO: Handle the last iteration carefully.
let i_amp = i * Vector.elementCount
let pointer = (row + i_amp).reinterpret_cast(Vector.self)
// (2)
// (3)
let value = exp(pointer.pointee - maxValue)
pointer.pointee = value
// (4)
sumValueVec += value
}
var sumValue: Float = sumValueVec.sum()
// PSEUDOCODE ENDS
// (5)
sumValue = simd_precise_recip(sumValue)
vDSP_vsmul(row, 1, &sumValue, row, 1, n)
} |
Why are you casting? it seems redundant. |
* Add AVX2 version of ggml_vec_dot_q4_1 * Small optimisations to q4_1 dot product (@Const-me) * Rearrange Q4_1 quantization to work for multipart models. (Fix ggerganov#152) * Fix ggml_vec_mad_q4_1 too * Fix non-vectorised q4_1 vec mul
We could use Accelerate to scale the vector here, similarly to how
add
andexp
use Accelerate.ggml/src/ggml.c
Lines 3250 to 3277 in 2992df0
https://developer.apple.com/documentation/accelerate/1450020-vdsp_vsmul
The text was updated successfully, but these errors were encountered: