PTQ for `generate_v2` #1866

joecummings · 2024-10-18T19:38:53Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

This PR adds post-training quantization support for generate_v2 via torchao. It is tested only for text-models, specifically Llama2.

Why did you change the way quantization APIs are called?
Good catch - notably I made it so that instead of creating a Quantizer class and having that quantize the model, I opted to use the quantize_ API from torchao and instantiate a quantization method instead. I did this for two reasons:

Simplifies our recipe and codebase.
It more consistent with the usage that torchao seems to be pushing. We want the UX to be the same whether someone is quantizing a model here or directly with torchao APIs

Does this work for vision models?
Technically, it runs, but we haven't fixed the torch.compile graph breaks in the Llama3.2 V model so it doesn't speed anything up. Therefore, I will not be including this in the default config for llama3.2V.

Why is it actually slower for the entire first run?
My assumption is that compile is the culprit here. Once everything has run once, the model compilation is pulled from the compile cache and things are actually faster. Still, quantized generation like this is typically better for longer responses where the benefit is really clear. cc @andrewor14 if my intuition is correct here.

This DOES NOT work for PTQ a QAT model. This will be added in a follow-up.

Changelog

Implement PTQ in generate_v2
Clean up some of the variables in generate_v2 to make things public
Added additional timing to split between first token and rest of tokens
Update llama2/generation_v2 to support quantization
Added a GPU test for quantized generation :)

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

All testing done with torchao v0.6.1 and torch 2.5.1

Recipe without PTQ:

(joe-torchtune-2) [jrcummings@devvm050.nha0 ~/projects/joe-torchtune (add-quantize-generate-v2)]$ tune run dev/generate_v2 --config llama2/generation_v2
Running InferenceRecipe with resolved config:

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-2-7b-chat-hf
  checkpoint_files:
  - pytorch_model-00001-of-00002.bin
  - pytorch_model-00002-of-00002.bin
  model_type: LLAMA2
  output_dir: ./
device: cuda
dtype: bf16
log_level: INFO
max_new_tokens: 500
model:
  _component_: torchtune.models.llama2.llama2_7b
prompt:
  system: You are a helpful and creative AI assistant.
  user: What is the capital of France?
seed: 1234
temperature: 0.6
tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  max_seq_len: 2048
  path: /tmp/Llama-2-7b-chat-hf/tokenizer.model
top_k: 300

Model was initialized with precision torch.bfloat16.
Time to generate first token: 0.45 sec

 Oh, how delightful! *adjusts glasses* The capital of France is... *drumroll* Paris! 🇫🇷 Yes, the City of Light, the City of Love, the City of Art, and the City of Delicious Croissants. 🥐 Is there anything else I can help you with? 😊

Time for inference: 4.93 sec total, 17.04 tokens/sec
Bandwidth achieved: 235.60 GB/s
Max memory allocated: 13.95 GB

Recipe with PTQ (first run):

(joe-torchtune-2) [jrcummings@devvm050.nha0 ~/projects/joe-torchtune (add-quantize-generate-v2)]$ tune run dev/generate_v2 --config llama2/generation_v2
Running InferenceRecipe with resolved config:

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-2-7b-chat-hf
  checkpoint_files:
  - pytorch_model-00001-of-00002.bin
  - pytorch_model-00002-of-00002.bin
  model_type: LLAMA2
  output_dir: ./
device: cuda
dtype: bf16
log_level: INFO
max_new_tokens: 500
model:
  _component_: torchtune.models.llama2.llama2_7b
prompt:
  system: You are a helpful and creative AI assistant.
  user: What is the capital of France?
quantization_method:
  _component_: torchao.quantization.quant_api.int4_weight_only
  use_hqq: false
seed: 1234
temperature: 0.6
tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  max_seq_len: 2048
  path: /tmp/Llama-2-7b-chat-hf/tokenizer.model
top_k: 300

Model was initialized with precision torch.bfloat16.
Compiling model layers with torch.compile...
Time to generate first token: 18.98 sec

 Ah, a question that is both simple and profound! *adjusts glasses* The capital of France, my dear human, is none other than the venerable city of Paris! 🇫🇷

But let me tell you more about this magnificent city, for it is a place of wonder and awe. Paris is home to some of the most iconic landmarks in the world, such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. The city is also renowned for its exquisite cuisine, its vibrant art scene, and its unparalleled fashion.

And did you know that Paris is the City of Light? *winks* It is here that some of the greatest minds in history have come to seek inspiration and knowledge. From the likes of Victor Hugo to Emile Zola, and from Claude Monet to Pierre-Auguste Renoir, the City of Paris has been the birthplace of countless artistic masterpieces.

So there you have it, my dear human! The capital of France is none other than the enchanting city of Paris, a place that will capture your heart and imagination like no other. 💖

Time for inference: 27.66 sec total, 9.84 tokens/sec
Bandwidth achieved: 136.00 GB/s
Max memory allocated: 13.95 GB

Recipe with PTQ (second run):

(joe-torchtune-2) [jrcummings@devvm050.nha0 ~/projects/joe-torchtune (add-quantize-generate-v2)]$ tune run dev/generate_v2 --config llama2/generation_v2
Running InferenceRecipe with resolved config:

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-2-7b-chat-hf
  checkpoint_files:
  - pytorch_model-00001-of-00002.bin
  - pytorch_model-00002-of-00002.bin
  model_type: LLAMA2
  output_dir: ./
device: cuda
dtype: bf16
log_level: INFO
max_new_tokens: 500
model:
  _component_: torchtune.models.llama2.llama2_7b
prompt:
  system: You are a helpful and creative AI assistant.
  user: What is the capital of France?
quantization_method:
  _component_: torchao.quantization.quant_api.int4_weight_only
  use_hqq: false
seed: 1234
temperature: 0.6
tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  max_seq_len: 2048
  path: /tmp/Llama-2-7b-chat-hf/tokenizer.model
top_k: 300

Model was initialized with precision torch.bfloat16.
Compiling model layers with torch.compile...
Time to generate first token: 4.56 sec

 Ah, a question that is both simple and profound! *adjusts glasses* The capital of France, my dear human, is none other than the venerable city of Paris! 🇫🇷

But let me tell you more about this magnificent city, for it is a place of wonder and awe. Paris is home to some of the most iconic landmarks in the world, such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. The city is also renowned for its exquisite cuisine, its vibrant art scene, and its unparalleled fashion.

And did you know that Paris is the City of Light? *winks* It is here that some of the greatest minds in history have come to seek inspiration and knowledge. From the likes of Victor Hugo to Emile Zola, and from Claude Monet to Pierre-Auguste Renoir, the City of Paris has been the birthplace of countless artistic masterpieces.

So there you have it, my dear human! The capital of France is none other than the enchanting city of Paris, a place that will capture your heart and imagination like no other. 💖

Time for inference: 11.92 sec total, 22.82 tokens/sec
Bandwidth achieved: 315.49 GB/s
Max memory allocated: 13.95 GB

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

To-do

Fix failing GPU test. It's passing locally, so I'm not sure how to make it work on the remote runners:

(joe-torchtune-2) [jrcummings@devvm050.nha0 ~/projects/joe-torchtune (add-quantize-generate-v2)]$ python -m pytest tests/recipes/dev/test_generate_v2.py::TestGenerateV2::test_llama2_generate_with_quantization --with-integration
Expected artifacts for test run are:
small-ckpt-tune-03082024.pt
small-ckpt-meta-03082024.pt
small-ckpt-hf-03082024.pt
small-ckpt-tune-llama3-05052024.pt
small-ckpt-hf-reward-07122024.pt
tokenizer.model
tokenizer_llama3.model
File already exists locally: /tmp/test-artifacts/small-ckpt-tune-03082024.pt
File already exists locally: /tmp/test-artifacts/small-ckpt-meta-03082024.pt
File already exists locally: /tmp/test-artifacts/small-ckpt-hf-03082024.pt
File already exists locally: /tmp/test-artifacts/small-ckpt-tune-llama3-05052024.pt
File already exists locally: /tmp/test-artifacts/small-ckpt-hf-reward-07122024.pt
File already exists locally: /tmp/test-artifacts/tokenizer.model
File already exists locally: /tmp/test-artifacts/tokenizer_llama3.model
================================================================================================================ test session starts ================================================================================================================
platform linux -- Python 3.11.9, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/jrcummings/projects/joe-torchtune
configfile: pyproject.toml
plugins: integration-0.2.3, mock-3.14.0, cov-5.0.0
collected 1 item

tests/recipes/dev/test_generate_v2.py .                                                                                                                                                                                                       [100%]

================================================================================================================ 1 passed in 42.70s =================================================================================================================

pytorch-bot · 2024-10-18T19:38:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1866

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Cancelled Jobs

As of commit fc501ee with merge base 33b8143 ():

NEW FAILURE - The following job has failed:

GPU tests / gpu_test (3.11, stable) (gh)
tests/recipes/dev/test_generate_v2.py::TestGenerateV2::test_llama2_generate_with_quantization

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
##[error]The operation was canceled.
GPU tests / gpu_test (3.9, stable) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings · 2024-10-18T20:26:41Z

recipes/dev/generate_v2.py

-        self._device = utils.get_device(device=cfg.device)
-        self._dtype = training.get_dtype(dtype=cfg.dtype, device=self._device)
-        self._logger = utils.get_logger(cfg.log_level)
+        self.device = utils.get_device(device=cfg.device)


It's a public recipe, no need to be a "private" variable.

cc @pbontrager

joecummings · 2024-10-18T20:26:54Z

recipes/dev/generate_v2.py

+
+        # Quantize the model if specified
+        if cfg.get("quantization_method") is not None:
+            from torchao.quantization.quant_api import quantize_


Lazily import torchao API

joecummings · 2024-10-18T20:27:15Z

recipes/dev/generate_v2.py

+            from torchao.quantization.quant_api import quantize_
+
+            quantization_method = config.instantiate(cfg.quantization_method)
+            compile_model(model)


Compiling the model is necessary for quantization to be really worth it

I'm curious whether compiling the model results in greater speedups than compiling the next-token-prediction fn like gptfast do

we should compile after quantize_ for speedup actually

Interesting! I was following the pattern from AO's README where the model is compiled first:

model = torchao.autoquant(torch.compile(model, mode='max-autotune'))

Why should the model be compiled after quantization?

@jerryzh168 Anecdotally, I don't see much difference in tok/sec (after first token) between putting compile first or second. Can you share some more details about which one is correct?

oh right now quantize_ needs to compile after: https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#full-affine-quantization-flow-example

but autoquant will do compile first before calling autoquant

mode='max-autotune' will take a long time to compile. Is it worth it? We dont do it for training.

Its interesting that in AO's read it says to put compile first. Do we also do it for QLoRA?

I don't see much difference in tok/sec (after first token) between putting compile first or second.

I haven't tried calling quantize_ after compile actually, maybe it would have the same effect as well, need to confirm

joecummings · 2024-10-18T20:27:51Z

recipes/dev/generate_v2.py


        # 6. Prefill step
        generated_tokens = []
        t0 = time.perf_counter()
        logits = self.model(prompt, **batch)[:, -1]
        token = sample(logits, temperature=cfg.temperature, top_k=cfg.top_k)
+        t1 = time.perf_counter()


Now that we might have a warmup run, we log this differently so the user can see how good quantization / compilation is.

joecummings · 2024-10-18T20:28:35Z

recipes/configs/llama2/generation_v2.yaml

@@ -9,6 +9,10 @@
 # Model arguments
 model:
  _component_: torchtune.models.llama2.llama2_7b
+# You can turn uncomment the following lines to enable quantization for faster inference and potentially lower VRAM


Leave this commented out until the user wants to do something with it.

joecummings · 2024-10-18T21:10:10Z

recipes/dev/generate_v2.py

-            prompt = torch.tensor(
-                model_inputs["tokens"], device=self._device
-            ).unsqueeze(0)
+            prompt = torch.tensor(model_inputs["tokens"], device=self.device)[None, :]


I wanted this to fit on one line lol

joecummings · 2024-10-18T21:11:17Z

tests/conftest.py

@@ -18,6 +19,13 @@
 CACHE_ARTIFACTS_SCRIPT_PATH = root + "/tests/cache_artifacts.sh"


+def pytest_sessionfinish():


Compile tries to log a bunch of stuff using the atexit decorator. However, pytest closes these logs before they finish so it throws an I/O error.

This disables logging exceptions. Not sure if the right way to do it.

…te-v2

joecummings · 2024-10-26T14:50:39Z

recipes/configs/llama2/generation_v2.yaml

 # Generation arguments
 prompt:
  system: You are a helpful and creative AI assistant.
  user: What is the capital of France?
-max_new_tokens: 200
+max_new_tokens: 500


Allow longer generation to really see the benefit of quant + compile.

codecov-commenter · 2024-10-26T15:04:22Z

Codecov Report

Attention: Patch coverage is 30.43478% with 16 lines in your changes missing coverage. Please review.

Project coverage is 25.92%. Comparing base (23c8829) to head (0575b67).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
tests/recipes/dev/test_generate_v2.py	25.00%	15 Missing ⚠️
tests/conftest.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1866       +/-   ##
===========================================
- Coverage   70.44%   25.92%   -44.53%     
===========================================
  Files         308      308               
  Lines       16270    16292       +22     
===========================================
- Hits        11462     4224     -7238     
- Misses       4808    12068     +7260

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joecummings · 2024-10-27T20:19:03Z

@felipemello1 @ebsmothers Will this not pass on PyTorch 2.5 b/c of the issue with CUDNN? This test passes locally on PyTorch v2.5.1.

Do we know when the patch will be released?

SalmanMohammadi · 2024-10-28T11:07:47Z

recipes/configs/llama2/generation_v2.yaml

@@ -9,6 +9,10 @@
 # Model arguments
 model:
  _component_: torchtune.models.llama2.llama2_7b
+# You can turn uncomment the following lines to enable quantization for faster inference and potentially lower VRAM
+# quantization_method:
+#   _component_: torchao.quantization.quant_api.int4_weight_only # int4_weight_only is a good balance of speed and memory


dumb q: so the torchtune.training.quantization API is just for QAT.. or we're not using it anymore?

I see you mentioned this in the PR description - if we're going to be using the torchao APIs instead it'd be good to follow up with an issue

SalmanMohammadi · 2024-10-28T12:57:09Z

This looks overall sensible, but a few outstanding questions I have:

What implications does this have for how we expose quantization APIs?
What is going on with compile?
Why is memory usage identical for non-PTQ, and PTQ? I guess because we're still peaking when we load weights in bf16, and we're measuring global max memory usage?
Why is it so slow? Even the second run of PTQ takes 12s vs 5s for non-PTQ - the bump in toks/s doesn't seem to offset whatever else is slowing it down
Noob q: given the above two points - max memory usage is identical and it takes longer... when would someone want to use this?

We probably don't need to answer all of these here but I think it'd help bring a lot of our quantization offerings in line if we can at least follow up on them.

joecummings · 2024-10-28T14:28:13Z

What implications does this have for how we expose quantization APIs?

I think the question is actually if we want to support PTQ APIs outside of torchao. If we do, we want want to opt for an approach like Hugging Face's wherein a config for a specific backend can be initialized. I'd argue that we probably don't want to b/c torchao already supports general quant, HQQ, and GPTQ (altho GPTQ is not available through the quantize_ API yet). Idk if this is too short sighted though.

What is going on with compile?

Not sure I understand the question. It's always slow during warmup run.

Why is memory usage identical for non-PTQ, and PTQ? I guess because we're still peaking when we load weights in bf16, and we're measuring global max memory usage?

Exactly.

Why is it so slow? Even the second run of PTQ takes 12s vs 5s for non-PTQ - the bump in toks/s doesn't seem to offset whatever else is slowing it down

Not sure what is so slow, but I've reached out to the AO team to see if this is normal.

Noob q: given the above two points - max memory usage is identical and it takes longer... when would someone want to use this?

An excellent question. I don't imagine anyone would want to use this recipe out of the box with quantization. However, it's a great playground for showing how easy it is to setup quantization with our models. The real benefit comes from serving this model somewhere so that you can compile + quant once and get continuous speed-ups for everything downstream. Also, if we end up having a super simple chat component, this would also demonstrate gains.

andrewor14 · 2024-10-28T16:06:20Z

Are you seeing the slowdown for int4_weight_only specifically? That's surprising since we have an efficient tinygemm cuda kernel for that, and the model size should actually be 1/4 of the original bf16 model size (unlike int8_dynamic_activation_int4_weight). Also cc @jerryzh168 @HDCharles who did some benchmarking on this from the AO side

andrewor14 · 2024-10-28T16:07:30Z

recipes/configs/llama2/generation_v2.yaml

+# You can turn uncomment the following lines to enable quantization for faster inference and potentially lower VRAM
+# quantization_method:
+#   _component_: torchao.quantization.quant_api.int4_weight_only # int4_weight_only is a good balance of speed and memory
+#   use_hqq: False # Turn on for more accurate results


@HDCharles is this true?

Yeah sorry this was anecdotal.

joecummings · 2024-10-28T16:51:12Z

Are you seeing the slowdown for int4_weight_only specifically?

I tried both int4_weight_only and dynamic activation version and both had initial slowdowns for the entire first run, but afterwards ran faster.

jerryzh168 · 2024-10-28T17:26:03Z

Are you seeing the slowdown for int4_weight_only specifically?

I tried both int4_weight_only and dynamic activation version and both had initial slowdowns for the entire first run, but afterwards ran faster.

slower on the first run is expected I feel, since compile actually happens at the first run when it sees the real inputs, typically when we do benchmark there will be some warmup runs for compile to actually run and we'll benchmark the following runs

joecummings · 2024-10-28T17:37:24Z

Are you seeing the slowdown for int4_weight_only specifically?

I tried both int4_weight_only and dynamic activation version and both had initial slowdowns for the entire first run, but afterwards ran faster.

slower on the first run is expected I feel, since compile actually happens at the first run when it sees the real inputs, typically when we do benchmark there will be some warmup runs for compile to actually run and we'll benchmark the following runs

I know that compile happens at the first forward pass, but what I'm seeing is a slowdown for the entire first generation of outputs (see logs in the PR description. Is this expected?

felipemello1 · 2024-10-30T17:50:41Z

recipes/configs/llama2/generation_v2.yaml

+# You can turn uncomment the following lines to enable quantization for faster inference and potentially lower VRAM
+# quantization_method:
+#   _component_: torchao.quantization.quant_api.int4_weight_only # int4_weight_only is a good balance of speed and memory
+#   use_hqq: False # Turn on to use Half-Quadratic Quantization


what does it mean? Can you add if it makes it faster/more accurate/less memory?

https://github.com/mobiusml/hqq

sorry, what i meant is that this should be made clear for the user in the comment :P

[WIP] Quantization for generate_v2

f89fdd4

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 18, 2024

joecummings added 2 commits October 18, 2024 12:49

Update config

86b7784

Add initial test for quantization

e006f78

joecummings commented Oct 18, 2024

View reviewed changes

joecummings linked an issue Oct 18, 2024 that may be closed by this pull request

Implement quantized model inference for generate_v2 #1814

Open

joecummings changed the title ~~[WIP] Quantization for generate_v2~~ [WIP] PTQ for generate_v2 Oct 18, 2024

Remove annoying logging errors due to atexit usage in _dynamo

eafd3b2

joecummings commented Oct 18, 2024

View reviewed changes

joecummings added 2 commits October 26, 2024 06:53

Merge remote-tracking branch 'upstream/main' into add-quantize-genera…

e09f1a1

…te-v2

Update comment for what is not supported

0575b67

joecummings marked this pull request as ready for review October 26, 2024 14:45

joecummings changed the title ~~[WIP] PTQ for generate_v2~~ PTQ for generate_v2 Oct 26, 2024

joecummings commented Oct 26, 2024

View reviewed changes

SalmanMohammadi reviewed Oct 28, 2024

View reviewed changes

andrewor14 reviewed Oct 28, 2024

View reviewed changes

joecummings added 9 commits October 29, 2024 04:21

Don't claim that HQQ is better

f318412

Debugging code for CI

2faf50c

Try smaller but more powerful, newer G5 runner

80bb4e3

Fix linting dummy

ff2ffba

Specify 2.5.1 in runer

322c802

Dumb typo

24508fb

Hopefully fix formatting to pick up GPU tests

67718b9

Switch order of these things b/c apparantly torchvision installs stable

a864ef9

2.5.1 was released so we're all good

31f64b5

felipemello1 reviewed Oct 30, 2024

View reviewed changes

Test reseting compile before test

fc501ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PTQ for `generate_v2` #1866

PTQ for `generate_v2` #1866

joecummings commented Oct 18, 2024 •

edited

Loading

pytorch-bot bot commented Oct 18, 2024 •

edited

Loading

joecummings Oct 18, 2024

joecummings Oct 18, 2024

joecummings Oct 18, 2024

SalmanMohammadi Oct 28, 2024

jerryzh168 Oct 28, 2024

joecummings Oct 28, 2024

joecummings Oct 29, 2024

jerryzh168 Oct 30, 2024

felipemello1 Oct 30, 2024

jerryzh168 Oct 30, 2024

joecummings Oct 18, 2024 •

edited

Loading

joecummings Oct 18, 2024

joecummings Oct 18, 2024 •

edited

Loading

joecummings Oct 18, 2024

joecummings Oct 26, 2024

codecov-commenter commented Oct 26, 2024

joecummings commented Oct 27, 2024

SalmanMohammadi Oct 28, 2024

SalmanMohammadi Oct 28, 2024

SalmanMohammadi commented Oct 28, 2024 •

edited

Loading

joecummings commented Oct 28, 2024

andrewor14 commented Oct 28, 2024

andrewor14 Oct 28, 2024

joecummings Oct 28, 2024

joecummings commented Oct 28, 2024

jerryzh168 commented Oct 28, 2024

joecummings commented Oct 28, 2024

felipemello1 Oct 30, 2024

joecummings Oct 30, 2024

felipemello1 Oct 30, 2024

		@@ -18,6 +19,13 @@
		CACHE_ARTIFACTS_SCRIPT_PATH = root + "/tests/cache_artifacts.sh"


		def pytest_sessionfinish():

PTQ for generate_v2 #1866

Are you sure you want to change the base?

PTQ for generate_v2 #1866

Conversation

joecummings commented Oct 18, 2024 • edited Loading

Context

Changelog

Test plan

UX

To-do

pytorch-bot bot commented Oct 18, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1866

❌ 1 New Failure, 2 Cancelled Jobs

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 26, 2024

Codecov Report

joecummings commented Oct 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi commented Oct 28, 2024 • edited Loading

joecummings commented Oct 28, 2024

andrewor14 commented Oct 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings commented Oct 28, 2024

jerryzh168 commented Oct 28, 2024

joecummings commented Oct 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PTQ for `generate_v2` #1866

PTQ for `generate_v2` #1866

joecummings commented Oct 18, 2024 •

edited

Loading

pytorch-bot bot commented Oct 18, 2024 •

edited

Loading

joecummings Oct 18, 2024 •

edited

Loading

joecummings Oct 18, 2024 •

edited

Loading

SalmanMohammadi commented Oct 28, 2024 •

edited

Loading