Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a simple program to measure speed of dot products #1041

Merged
merged 2 commits into from
Apr 18, 2023
Merged

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Apr 18, 2023

I was surprised by the belief that the dot product x * y, where x holds quantized model weights and y contains floating point values, it is faster to quantize y, and to perform the dot product using the quantized y values (accepting the associated loss in precision), then to just directly compute x * y. So, I had to try it myself. This PR adds a simple program that measures the time it takes to perform the dot product between vectors holding 1 << 18 values. I picked a relatively large vector size to not have to get involved with the science of accurately measuring elapsed time for short lasting operations.

Basically, we fill two vectors x and y with random values and quantize x into q. We then measure the time for

  1. Computing d = q * y directly
  2. Computing y' = quantize(y); d = q * y'

For 2. we use the vectorized (SIMD-ified) functions from ggml (or, if requested by a command line argument, the corresponding scalar functions from ggml).

On my Mac, 1. is faster than 2 (~55 us vs ~75 us). On the x86_64 CPU that I have available (Ryzen 7950X), 1. is somewhat slower compared to the AVX2 implementation (~50 us vs ~35 us).

On both CPUs the direct product 1. as implemented in the dot() function in this POC is much faster than the scalar version of 2 from ggml. (~15X faster on the Ryzen 7950X and ~6X faster on the Mac).

I think that with some ARM_NEON or AVX2 magic one should be able to further speed up 1.

To use it, make -j and then e.g. ./vdot 100 to measure 100 dot products with the SIMD-ified ggml functions, or ./vdot 100 1 to measure the scalar ggml functions instead.

Added a comparison for Q4_1 quantization. Here, the direct product 1. is faster than 2. for ARM_NEON and AVX2. On my Mac I get ~69 us for 1 and ~121 us for 2. On the Ryzen 7950X I measured ~60 us for 1. and ~62 us for 2. In any case, implemented as in this POC, the dot product of Q4_1 quantized values is only marginally slower (~20%) than Q4_0.

@ikawrakow ikawrakow requested a review from ggerganov April 18, 2023 14:41
@sw
Copy link
Contributor

sw commented Apr 18, 2023

On my Core i3-8100 (AVX2):

$ ./vdot 100
<dot> = -74.2272, -73.9193
time = 128.407 +/- 4.38483 us. maxt = 150.281 us
timeq = 106.679 +/- 4.16175 us. maxt = 126.484 us

Please consider putting it in examples/benchmark instead of creating a new folder.

On my Mac, the direct Q4_1 product is marginally slower
(~69 vs ~55 us for Q4_0). The SIMD-ified ggml version
is now almost 2X slower (~121 us).

On a Ryzen 7950X CPU, the direct product for Q4_1 quantization
is faster than the AVX2 implementation (~60 vs ~62 us).
@ikawrakow
Copy link
Contributor Author

@sw Thank you for the measurement. Yes, of course, I can move to examples. My thinking was that this is a POC, so it is better to have a folder for POCs for this (and possibly future POCs) before some of these POCs become "examples".

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either adding poc or moving to examples is fine. Just leave it as it is if you haven't started moving it

I'll try to plug this approach in the actual mul mat routines and see how is the performance.

P.S. Squash your commits on merge

@sw
Copy link
Contributor

sw commented Apr 18, 2023

Yes, examples is maybe not a great name, but it already contains various bits and pieces like your program.

@sw sw merged commit 5ecff35 into master Apr 18, 2023
@sw sw deleted the poc-dot-product branch April 18, 2023 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants