-
Notifications
You must be signed in to change notification settings - Fork 267
Eval hf models using lm_eval #2179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2179
Note: Links to docs will display an error until the docs builds have been completed. ❌ 7 New FailuresAs of commit e14c186 with merge base 7854249 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torchao/_models/llama/eval_hf.py
Outdated
# print("Response:", output_text[0][len(prompt):]) | ||
|
||
|
||
def model_throughput( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe you can add a memory measurement as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
## Eval on Llama 3.1 8B and Llama 3.2 3B | ||
|
||
We use lm-eval tasks for evaluating TorchAO Quantization APIs on HuggingFace models. The results are in the table below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would specify what task this is for.
| Llama 3.1 8B | int8dq | 60.01 | 78.82 | 7.45 | 8.03 | | ||
| Llama 3.1 8B | float8wo | 59.83 | 78.61 | 7.37 | 8.03 | | ||
| Llama 3.1 8B | float8dq (PerRow) | 59.86 | 78.57 | 7.41 | 8.04 | | ||
| Llama 3.1 8B | float8dq (PerTensor) | 59.95 | 78.66 | 7.42 | 8.03 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per tensor is more accurate than per row quantization?
@@ -291,6 +292,15 @@ def string_to_config( | |||
else: | |||
granularity = PerTensor() | |||
return Float8DynamicActivationFloat8WeightConfig(granularity=granularity) | |||
if "gemlitewo" in quantization: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gemlite can be 4 bit or 8 bit, should probably specify that this is for 4 bit
https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_api.py#L968
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good aside from those small comments
Using the huggingface and torchao integration, we can directly quantize hf models using torchao quantization techniques. The current PR adds a flow for quantizing an hf model using transformer's TorchAOConfig module with torchao's quant techniques, followed by evaluation of the quantized model on different lm-eval tasks.
Eval results for llama3.1-8B and 3.2-3B are added to torchao/_models/README.md