Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quantize-stats command for testing quantization #728

Merged
merged 6 commits into from
Apr 7, 2023

Conversation

unbounded
Copy link
Contributor

Adds a quantize-stats binary that calculates some statistics over the errors introduced by quantization of a given model.
At the moment it shows mean square error and max error for layer weights, as well as a quantization error histogram. Should be useful for testing quantization improvements without having to do a full perplexity run.

Needs some internal state from ggml and llama that should not be part of the public API, so I moved those out to internal headers - not the prettiest solution but could be useful for other tests as well.

Simple example - short summary of quantization format errors for all layers except .

$ ./quantize-stats --model models/7B/ggml-model-f16.bin --exclude-layer norm
Loading model
llama_model_load: loading model from 'models/7B/ggml-model-f16.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 256
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 1
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 12853.45 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 14645.53 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from 'models/7B/ggml-model-f16.bin'
llama_model_load: model size = 12853.02 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB
note: source model is f16
testing 226 layers with max size 131072000, allocating 1572864000 bytes
q4_0                                              : mse 0.00000492, maxerr 0.14257812
q4_1                                              : mse 0.00000318, maxerr 0.12756348

main:    total time = 130642.79 ms

Another example - quicker test on a single layer, with detailed output:

$ ./quantize-stats -m models/7B/ggml-model-f16.bin --exclude-layer norm --include-layer 25 --type q4_0 --per-layer-stats --histogram
output
Loading model
llama_model_load: loading model from 'models/7B/ggml-model-f16.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 256
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 1
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 12853.45 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 14645.53 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from 'models/7B/ggml-model-f16.bin'
llama_model_load: model size = 12853.02 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB
note: source model is f16
testing 7 layers with max size 45088768, allocating 541065216 bytes
q4_0::layers.25.attention.wk.weight               : mse 0.00000593, maxerr 0.02381897
q4_0::layers.25.attention.wo.weight               : mse 0.00000501, maxerr 0.05418178
q4_0::layers.25.attention.wq.weight               : mse 0.00000580, maxerr 0.04802595
q4_0::layers.25.attention.wv.weight               : mse 0.00000513, maxerr 0.00928388
q4_0::layers.25.feed_forward.w1.weight            : mse 0.00000522, maxerr 0.02363586
q4_0::layers.25.feed_forward.w2.weight            : mse 0.00000476, maxerr 0.04512678
q4_0::layers.25.feed_forward.w3.weight            : mse 0.00000484, maxerr 0.02100045
q4_0                                              : mse 0.00000512, maxerr 0.05418178
Error distribution:
[0.000, 0.001):    59323206
[0.001, 0.002):    52871092
[0.002, 0.003):    49542959
[0.003, 0.004):    29974833
[0.004, 0.005):     8719625
[0.005, 0.006):     1609993
[0.006, 0.007):      257769
[0.007, 0.008):       46969
[0.008, 0.009):       12567
[0.009, 0.010):        5462
[0.010, 0.011):        2979
[0.011, 0.012):        2006
[0.012, 0.013):        1411
[0.013, 0.014):        1012
[0.014, 0.015):         781
[0.015, 0.016):         559
[0.016, 0.017):         382
[0.017, 0.018):         292
[0.018, 0.019):         243
[0.019, 0.020):         180
[0.020, 0.021):         129
[0.021, 0.022):         100
[0.022, 0.023):          79
[0.023, 0.024):          83
[0.024, 0.025):          66
[0.025, 0.026):          53
[0.026, 0.027):          39
[0.027, 0.028):          46
[0.028, 0.029):          40
[0.029, inf):         213

main:    total time =  3309.60 ms

Command that calculates some statistics over the errors introduced by
quantization, at the moment mean square error and max error for layer
weights. Should be useful for testing quantization improvements.

Needs some internal state from ggml and llama that should not be part of
the public API.
@sw
Copy link
Contributor

sw commented Apr 3, 2023

There's a relevant Python snippet by @prusnak here, but I'm not quite sure how he chose the scaling of the random samples.

Show some error percentiles, should be less noisy than just the max error.
@prusnak
Copy link
Collaborator

prusnak commented Apr 3, 2023

I'm not quite sure how he chose the scaling of the random samples.

It was chosen by random, so we should disregard it anyway. This PR tests on the real data, right?

@prusnak
Copy link
Collaborator

prusnak commented Apr 3, 2023

Does it make sense to also show histogram of the bins? Or is showing histogram of error distributions enough for analysis?

@sw
Copy link
Contributor

sw commented Apr 3, 2023

I'm running out of memory here, is there an easy way this could be made to run on 16GB?

I should get around to buying some more...

Do you think this could be done more like quantize.cpp, or even as an optional mode of that existing example program, to do batch processing without loading the entire model at once?

@sw
Copy link
Contributor

sw commented Apr 3, 2023

Here's a quick & dirty hack to llama.cpp so that the statistics will be collected in the existing quantize.cpp: sw@cec7cb9

Simply omit the output file parameter:

$ ./quantize models/7B/ggml-model-f16.bin 2
usage: ./quantize model-f32.bin model-quant.bin type
Assuming you just want statistics
...
q4_0                                              : mse 0.00000492, maxerr 0.14257812

If the implementation could be made cleaner, I would prefer this, as it doesn't load the entire model in memory. That doesn't mean I'm against this PR as it is.

We might also take the opportunity to cut down on quantize.cpp's verbosity in normal operation. Most people would probably like to see a simple progress indicator.

@prusnak
Copy link
Collaborator

prusnak commented Apr 3, 2023

That's a good point. It didn't occur to me earlier that this PR loads the whole model in the memory and there. If that's the case, the functionality indeed should be rolled into quantize.cpp which does not do this.

@unbounded
Copy link
Contributor Author

Thank you for your comments,

I think it is a good idea to show some more statistics in quantize. I like having a separate utility for myself when testing quantization, because I can do a test run against a single layer in a few seconds without quantizing the entire model. But there is a question if that is widely useful.

I thought about showing the weight histogram, but I expect people would rather investigate the model in Python or similar. This should be focused on the needs specific to this project, like testing a new quantization implementation.

I don't think loading the entire model into memory should be much of an issue because of the mmap changes - it should evict the model from memory again as needed.
I see that llama preallocates some buffers for each model when it loads them. I can't set vocab_only because then it won't load the tensors, but maybe I could add a similar parameter that promises to not evaluate anything.
I did unneccesarily allocate enough scratch space to quantize a whole layer at once instead of quantizing smaller chunks, I will fix that.

Test quantization in smaller chunks instead of layer-at-a-time.
Show RMSE instead of MSE - keeps similar range to the other metrics.
Regex match on layer pattern.
@sw
Copy link
Contributor

sw commented Apr 4, 2023

@unbounded : with the latest changes I'm now able to run it with 16GB, thanks for that. Though the resident size still grows to over 12GB. It would be nice if this could be made smaller, but I don't consider it a deal-breaker.

You might add quantize-stat to the default: Makefile target so it gets built along with everything else.

This PR is somewhat incompatible with #729 because ggml_internal_get_quantize_fn gives you the SIMD-optimized quantize_row_q4_0, not quantize_row_q4_0_reference. We could add another member to the quantize_fns_t struct for that.

I think it is a good idea to show some more statistics in quantize.

If we merge this PR, then I'd say afterwards we better move the verbose output from quantize to quantize-stats. Many people using quantize just care about the output files, not having a lot of stuff printed.

Expose reference quantization implementation and add option to use it
for tests.
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expose the internal stuff through the regular .h files.
Just add comments that this is for debugging / testing purposes and one should not rely on these functions in their projects

Move into main header with comment not to use, per PR feedback
@unbounded
Copy link
Contributor Author

Removed the internal header files. The mixing of C and C++ definitions makes it a bit messy but I can't really think of a good way to make it cleaner.

@sw
Copy link
Contributor

sw commented Apr 6, 2023

The mixing of C and C++ definitions makes it a bit messy but I can't really think of a good way to make it cleaner.

restrict may not be necessary in the function typedefs such as dequantize_row_q_t. I think you can just have it in the function definition, or is that not valid C? It's only used so the compiler can optimize the code inside the function, it should not have any effects outside the function.

@ggerganov
Copy link
Owner

@unbounded
Merge this if you are ready. It's not very clean, but it is important to have this tool now while we try to improve the quantization

@unbounded unbounded merged commit 62cfc54 into ggerganov:master Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants