Skip to content

Conversation

hegemanjw4amd
Copy link

This PR creates a fully general Int4-AWQ dequantization function which uses torch and adds environment options (flags) for controlling torch-vs-triton codepaths.

Testing: Two HuggingFace models quantized in Int4-AWQ format have been successfully run:
Qwen2-7B-Instruct-AWQ (Latency benchmarking)
Phi-3-mini-4k-instruct-AWQ (Input verification)
For the latter model, specific input prompts were supplied and the output examined, in order to provide a sanity check for correctness.

Unit testing is accomplished via tests/kernels/test_awq_triton.py.

Resolves: https://github.com/ROCm/FasterTransformer-Internal/issues/287

@hegemanjw4amd hegemanjw4amd force-pushed the hegeman/basic-sdpa-attention-int4-awq-interim branch 3 times, most recently from 0b78568 to dd9a148 Compare August 21, 2024 10:47
Copy link
Collaborator

@shajrawi shajrawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ship it

@hegemanjw4amd hegemanjw4amd force-pushed the hegeman/basic-sdpa-attention-int4-awq-interim branch from dd9a148 to d4332ec Compare August 21, 2024 16:15
@hegemanjw4amd hegemanjw4amd merged commit 4e9830e into main Aug 21, 2024
13 checks passed
@gshtras gshtras deleted the hegeman/basic-sdpa-attention-int4-awq-interim branch September 10, 2024 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants