Eval hf models using lm_eval #2179

jainapurva · 2025-05-06T19:12:19Z

Using the huggingface and torchao integration, we can directly quantize hf models using torchao quantization techniques. The current PR adds a flow for quantizing an hf model using transformer's TorchAOConfig module with torchao's quant techniques, followed by evaluation of the quantized model on different lm-eval tasks.

Eval results for llama3.1-8B and 3.2-3B are added to torchao/_models/README.md

pytorch-bot · 2025-05-06T19:12:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2179

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures

As of commit e14c186 with merge base 7854249 ():

NEW FAILURES - The following jobs have failed:

Run Float8 Tests / test (SM-89, linux.g6.4xlarge.experimental.nvidia.gpu, --pre torch --index-url https://download.p... / linux-job (gh)
RuntimeError: Command docker exec -t 4bedce355a2936792b61f9f73cdbb4004a3545e61950673909db728b4a83761c /exec failed with exit code 139
Run Regression Tests / test (CPU 2.6, linux.4xlarge, torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t 9924c912781eb5115267d2454bfd6e9526efac284ecce1bf1f4ec0792ee6ea0c /exec failed with exit code 139
Run Regression Tests / test (CPU 2.7, linux.4xlarge, torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t 91c3eb9788e86a0f1c498539c0da21038a65ea9a4b3388760b585101bffd85cb /exec failed with exit code 139
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 11ef034d7a4b1c6487b84c91d3cc3e206d4d39c59e83703b248fc1a60ec34d36 /exec failed with exit code 139
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t e129c5c8a0dd29d208157cad2c78d114c0220e524c294c631f2c114134ca05b7 /exec failed with exit code 139
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
RuntimeError: Command docker exec -t 8a4ffe215c1627b25e02184d499ac073e400b6e422f28e7e98b7eae339914687 /exec failed with exit code 139
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t fbb1ab5e44f403383d523b8fa143d5e6869c37679d64e4ae3b2282e0b6f0c674 /exec failed with exit code 139

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2025-05-06T20:11:37Z

torchao/_models/llama/eval_hf.py

+#     print("Response:", output_text[0][len(prompt):])
+
+
+def model_throughput(


nit: maybe you can add a memory measurement as well

HDCharles · 2025-05-22T18:01:35Z

torchao/_models/README.md

+
+## Eval on Llama 3.1 8B and Llama 3.2 3B
+
+We use lm-eval tasks for evaluating TorchAO Quantization APIs on HuggingFace models. The results are in the table below:


would specify what task this is for.

HDCharles · 2025-05-22T18:03:50Z

torchao/_models/README.md

+| Llama 3.1 8B  | int8dq                 | 60.01 | 78.82  |       7.45     | 8.03              |
+| Llama 3.1 8B  | float8wo               | 59.83 | 78.61  |       7.37     | 8.03              |
+| Llama 3.1 8B  | float8dq (PerRow)      | 59.86 | 78.57  |       7.41     | 8.04              |
+| Llama 3.1 8B  | float8dq (PerTensor)   | 59.95 | 78.66  |       7.42     | 8.03              |


per tensor is more accurate than per row quantization?

HDCharles · 2025-05-22T18:05:27Z

benchmarks/microbenchmarks/utils.py

@@ -291,6 +292,15 @@ def string_to_config(
        else:
            granularity = PerTensor()
        return Float8DynamicActivationFloat8WeightConfig(granularity=granularity)
+    if "gemlitewo" in quantization:


gemlite can be 4 bit or 8 bit, should probably specify that this is for 4 bit

https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_api.py#L968

HDCharles

looks good aside from those small comments

Eval hf models using lm_eval

d17ffd3

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 6, 2025

Throughput updates

dc355d6

jerryzh168 reviewed May 6, 2025

View reviewed changes

jainapurva added 6 commits May 6, 2025 13:27

Updates

8c7583b

Add sh script

2768c17

Add sh script to regenerate numbers

fa24a4f

Add sh script to regenerate numbers wikitext

984e518

Updated code for model size

79f8af6

Fix readme issues

c8a591a

jainapurva added topic: not user facing Use this tag if you don't want this PR to show up in release notes topic: for developers Use this tag if this PR is mainly developer facing labels May 12, 2025

jainapurva requested a review from HDCharles May 12, 2025 17:42

Remove throughput

116bf96

jainapurva marked this pull request as ready for review May 14, 2025 05:23

jainapurva added 4 commits May 21, 2025 15:33

Merge remote-tracking branch 'origin/main' into eval_hf_models

b4fb034

Update readme

1b27430

Update readme

0a3a2bb

Update readme

e14c186

HDCharles reviewed May 22, 2025

View reviewed changes

HDCharles approved these changes May 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval hf models using lm_eval #2179

Eval hf models using lm_eval #2179

Uh oh!

jainapurva commented May 6, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 6, 2025 •

edited

Loading

Uh oh!

jerryzh168 May 6, 2025

Uh oh!

jainapurva May 12, 2025

Uh oh!

HDCharles May 22, 2025

Uh oh!

HDCharles May 22, 2025

Uh oh!

HDCharles May 22, 2025

Uh oh!

HDCharles left a comment

Uh oh!

Uh oh!

		# print("Response:", output_text[0][len(prompt):])


		def model_throughput(


		## Eval on Llama 3.1 8B and Llama 3.2 3B

		We use lm-eval tasks for evaluating TorchAO Quantization APIs on HuggingFace models. The results are in the table below:

Eval hf models using lm_eval #2179

Are you sure you want to change the base?

Eval hf models using lm_eval #2179

Uh oh!

Conversation

jainapurva commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2179

❌ 7 New Failures

Uh oh!

jerryzh168 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

jainapurva May 12, 2025

Choose a reason for hiding this comment

Uh oh!

HDCharles May 22, 2025

Choose a reason for hiding this comment

Uh oh!

HDCharles May 22, 2025

Choose a reason for hiding this comment

Uh oh!

HDCharles May 22, 2025

Choose a reason for hiding this comment

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jainapurva commented May 6, 2025 •

edited

Loading

pytorch-bot bot commented May 6, 2025 •

edited

Loading