Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Accelerate for vector scale #193

Closed
philipturner opened this issue May 24, 2023 · 9 comments · Fixed by #380
Closed

Using Accelerate for vector scale #193

philipturner opened this issue May 24, 2023 · 9 comments · Fixed by #380
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@philipturner
Copy link

We could use Accelerate to scale the vector here, similarly to how add and exp use Accelerate.

ggml/src/ggml.c

Lines 3250 to 3277 in 2992df0

inline static void ggml_vec_scale_f32(const int n, float * y, const float v) {
#if defined(GGML_SIMD)
const int np = (n & ~(GGML_F32_STEP - 1));
GGML_F32_VEC vx = GGML_F32_VEC_SET1(v);
GGML_F32_VEC ay[GGML_F32_ARR];
for (int i = 0; i < np; i += GGML_F32_STEP) {
for (int j = 0; j < GGML_F32_ARR; j++) {
ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
ay[j] = GGML_F32_VEC_MUL(ay[j], vx);
GGML_F32_VEC_STORE(y + i + j*GGML_F32_EPR, ay[j]);
}
}
// leftovers
for (int i = np; i < n; ++i) {
y[i] *= v;
}
#else
// scalar
for (int i = 0; i < n; ++i) {
y[i] *= v;
}
#endif
}

https://developer.apple.com/documentation/accelerate/1450020-vdsp_vsmul

@ggerganov ggerganov added enhancement New feature or request good first issue Good for newcomers labels May 25, 2023
@jaeminSon
Copy link
Contributor

I naively thought adding "if defined" on the top and setting the type correctly for 'vDSP_vsmul' would solve the problem easily. But when I modify the code like the following, I get segmentation fault. What do you think is the problem?

inline static void ggml_vec_scale_f32(const int n, float * y, const float v) {
#if defined(GGML_USE_ACCELERATE)
    vDSP_vsmul(y, 1, y, (float*) &v, 1, n);
#elif defined(GGML_SIMD)
.... // codes below intact
`

@philipturner
Copy link
Author

philipturner commented May 26, 2023

Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?

Btw I think you could vastly improve the softmax part by writing vectorized code that fuses each kernel call. Calling into Accelerate this way makes it memory bound with most time spent reading and writing everything from L1.

@philipturner
Copy link
Author

I don’t know why, but LLaMa.cpp is much slower than it should theoretically be. Going by @ggerganov’s CPU bandwidth (200 GB/s), the CPU cores should eat the entire 6.7B-q4.5 model in 16 ms. But for some reason the token latency is 43 ms.

That’s a 2-3x speed up we could have by redesigning the code, not just an incremental speed up.

@ggerganov
Copy link
Owner

ggerganov commented May 26, 2023

Going by @ggerganov’s CPU bandwidth (200 GB/s)

This number is Apple's claim for the memory bandwidth of M1 Pro if I remember correctly.
I haven't been able to reproduce this speed. The best I've seen is ~80-90 GB/s: ggerganov/llama.cpp#34 (comment)

And regarding single thread, it's no more than 40GB/s

@philipturner
Copy link
Author

philipturner commented May 26, 2023

Single thread has reached 100 GB/s in some benchmarks. When it’s occupied with other work or code is improperly written, it can’t utilize all of that. But then there are 8 cores total to harness that bandwidth.

On GPU (M1 Max), I have achieved 378 GB/s out of 400 GB/s in a custom Metal blit command. It requires careful tuning - aligning the data structure to 64B boundaries. From what I can tell, LLaMa.cpp is not aligned.

https://github.com/philipturner/metal-usm/blob/23e9f324fd3e4ecb1078cf6b211bd25753de718b/BlitEncoderAlternative/MainFile.swift#L31-L60

Going so far as to shuffle data around in threadgroup memory, just so whatever it eats and spits out is 64B aligned:

https://github.com/philipturner/metal-usm/blob/23e9f324fd3e4ecb1078cf6b211bd25753de718b/BlitEncoderAlternative/Kernels.metal#L177-L200

@jaeminSon
Copy link
Contributor

Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?

Shame! it should be

vDSP_vsmul(y, 1, (float*) &v, y, 1, n);

No segment fault anymore!

@jaeminSon
Copy link
Contributor

I ran several times but SIMD tends to be faster.

hardware: MacBook Pro (Retina, 13-inch, Early 2015), 2.7 GHz dual core Intel Core i5, 8GB 1867 MHz DDR3, Intel Iris Graphics 6100 1536 MB
os: Mac OS Monterey (v12.6)
gpt-model: Cerebras-GPT-111M

output of GGML_SIMD

this is a tokenization test, but the user is getting the response.
I'm trying to test for a method called response:
         if (user == null)
             return response.getJSON().text()

        response.getJSON().text()

        response.getJSON().text()

This is the code that works. The exception is the method that I use to get the response.
I would appreciate any help, in any event.

A:

You're getting the response.
This is the tokenization test

It's probably the first time you've used it, but you're not exactly sure how to do it.
There are many things you can do to improve the way you are able to work with this code. It's the only way you can change a tokenization test for the model,

main: mem per token =  1712332 bytes
main:     load time =   715.21 ms
main:   sample time =    59.12 ms
main:  predict time =  9944.77 ms / 48.51 ms per token
main:    total time = 12506.17 ms

output using vDSP_vsmul,

main: prompt: 'this is a tokenization test'
main: number of tokens in prompt = 6, first 8 tokens: 5661 318 257 11241 1634 1332 

this is a tokenization test with this method and this method has a user-defined tokenizer.
                                                                                                                                                                                         

main: mem per token =  1712332 bytes
main:     load time =   842.32 ms
main:   sample time =    61.45 ms
main:  predict time = 11836.78 ms / 57.74 ms per token
main:    total time = 15593.00 ms

@philipturner
Copy link
Author

philipturner commented May 26, 2023

Try replacing some other Accelerate calls with vectorized code. Bonus if you can fuse two elementwise operations of the softmax without writing the element back to memory in between.

  // NOTE: Softmax is expected to consume the most time, due to the latency of
  // each function call and inability to keep the elements in registers.
  // Consider writing vectorized Swift code for a fairer comparison to GPU.
  
  // Pseudocode for softmax operation:
  // (1) find maximum element in each row
  // (2) subtract the maximum from all elements
  // (3) apply the exponential operator to all elements
  // (4) find the sum of each row
  // (5) divide all elements by the sum
  for i in 0..<UInt(NQ) {
    // The elements to operate on.
    let n = UInt(NKV)
    let row = _QK + Int(i * n)
    
    // (1)
    var maxValue: Float = 0
    vDSP_maxv(row, 1, &maxValue, n)
    assert(maxValue != 0)
    
    // (2)
    maxValue = -maxValue
    vDSP_vsadd(row, 1, &maxValue, row, 1, n)
    
    // (3)
    vvexpf(row, row, &NKV)
    
    // (4)
    var sumValue: Float = 0
    vDSP_sve(row, 1, &sumValue, n)
    
    // (5)
    sumValue = simd_precise_recip(sumValue)
    vDSP_vsmul(row, 1, &sumValue, row, 1, n)
  }

Becomes

  for i in 0..<UInt(NQ) {
    // The elements to operate on.
    let n = UInt(NKV)
    let row = _QK + Int(i * n)
    
    // (1)
    var maxValue: Float = 0
    vDSP_maxv(row, 1, &maxValue, n)
    assert(maxValue != 0)

    // PSEUDOCODE STARTS
    typealias Vector = SIMD16<Float> // Try multiple vector lengths.
    var sumValueVec: Vector = .zero
    for i in 0..<n / Vector.elementCount { // TODO: Handle the last iteration carefully.
       let i_amp = i * Vector.elementCount
       let pointer = (row + i_amp).reinterpret_cast(Vector.self)

       // (2)
       // (3)
       let value = exp(pointer.pointee - maxValue)
       pointer.pointee = value

       // (4)
       sumValueVec += value
    }
    var sumValue: Float = sumValueVec.sum()
    // PSEUDOCODE ENDS
    
    // (5)
    sumValue = simd_precise_recip(sumValue)
    vDSP_vsmul(row, 1, &sumValue, row, 1, n)
  }

@nullhook
Copy link
Contributor

nullhook commented Jul 5, 2023

Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?

Shame! it should be

vDSP_vsmul(y, 1, (float*) &v, y, 1, n);

No segment fault anymore!

Why are you casting? it seems redundant.

@philipturner philipturner reopened this Jul 14, 2023
CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023
* Add AVX2 version of ggml_vec_dot_q4_1

* Small optimisations to q4_1 dot product (@Const-me)

* Rearrange Q4_1 quantization to work for multipart models. (Fix ggerganov#152)

* Fix ggml_vec_mad_q4_1 too

* Fix non-vectorised q4_1 vec mul
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants