Adding a simple program to measure speed of dot products #1041
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was surprised by the belief that the dot product
x * y
, wherex
holds quantized model weights andy
contains floating point values, it is faster to quantizey
, and to perform the dot product using the quantizedy
values (accepting the associated loss in precision), then to just directly computex * y
. So, I had to try it myself. This PR adds a simple program that measures the time it takes to perform the dot product between vectors holding1 << 18
values. I picked a relatively large vector size to not have to get involved with the science of accurately measuring elapsed time for short lasting operations.Basically, we fill two vectors
x
andy
with random values and quantizex
intoq
. We then measure the time ford = q * y
directlyy' = quantize(y); d = q * y'
For 2. we use the vectorized (SIMD-ified) functions from
ggml
(or, if requested by a command line argument, the corresponding scalar functions fromggml
).On my Mac, 1. is faster than 2 (~55 us vs ~75 us). On the
x86_64
CPU that I have available (Ryzen 7950X), 1. is somewhat slower compared to theAVX2
implementation (~50 us vs ~35 us).On both CPUs the direct product 1. as implemented in the
dot()
function in this POC is much faster than the scalar version of 2 fromggml
. (~15X faster on the Ryzen 7950X and ~6X faster on the Mac).I think that with some
ARM_NEON
orAVX2
magic one should be able to further speed up 1.To use it,
make -j
and then e.g../vdot 100
to measure 100 dot products with the SIMD-ifiedggml
functions, or./vdot 100 1
to measure the scalarggml
functions instead.Added a comparison for
Q4_1
quantization. Here, the direct product 1. is faster than 2. forARM_NEON
andAVX2
. On my Mac I get ~69 us for 1 and ~121 us for 2. On the Ryzen 7950X I measured ~60 us for 1. and ~62 us for 2. In any case, implemented as in this POC, the dot product ofQ4_1
quantized values is only marginally slower (~20%) thanQ4_0
.