-
Notifications
You must be signed in to change notification settings - Fork 19.6k
Adds int4 quantization support to EinsumDense #21471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @JyotinderSingh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive support for 4-bit integer (int4) quantization within the EinsumDense
layer in Keras. This enhancement aims to improve memory efficiency for models by enabling the storage and computation of kernel weights in a reduced precision format. The changes span the entire lifecycle of the layer, from its initial build and quantization to its forward pass, integration with LoRA for fine-tuning, and proper serialization for saving and loading models. The update also includes significant refactoring and documentation improvements to ensure maintainability and clarity of the new quantization logic.
Highlights
- Int4 Quantization Support: The
EinsumDense
layer now fully supports 4-bit integer (int4) quantization for its kernel weights. This includes the necessary build-time setup, variable management, and runtime operations for int4 data types. - LoRA Integration with Int4: Low-Rank Adaptation (LoRA) functionality has been enhanced to correctly integrate with
int4
quantized kernels. This involves precise handling of packed kernel shapes when creating LoRA matrices and during the merging process, ensuring LoRA updates are applied in the full-precision domain before requantization. - Comprehensive Quantization Lifecycle: The entire lifecycle of the
EinsumDense
layer has been updated to accommodateint4
quantization. This covers thebuild
process, thequantize
method, the_int4_call
forward pass, and serialization/deserialization (save_own_variables
,load_own_variables
,_get_kernel_with_merged_lora
), ensuring consistent behavior across all stages. - Code Clarity and Documentation: Key internal methods, such as
_get_kernel_with_merged_lora
indense.py
and several_analyze_
helper functions ineinsum_dense.py
, have been refactored and/or received detailed docstrings. This improves the readability and maintainability of the quantization and einsum analysis logic. - Expanded Test Coverage: Existing unit tests for quantization and LoRA have been significantly expanded to include
int4
specific test cases. This ensures the robustness and correctness of the newint4
functionality and its seamless integration with other features.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces int4 quantization support to the EinsumDense layer. The changes include modifications to the build process, forward pass, LoRA integration, and serialization logic. The tests have been updated to cover both int8 and int4 modes. A docstring inaccuracy was identified in the _int4_call method and should be addressed.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21471 +/- ##
==========================================
- Coverage 82.80% 77.23% -5.58%
==========================================
Files 565 565
Lines 55505 55620 +115
Branches 8662 8684 +22
==========================================
- Hits 45962 42957 -3005
- Misses 7429 10598 +3169
+ Partials 2114 2065 -49
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
58a6e78
to
922f2b6
Compare
7f7919a
to
6dca9da
Compare
This PR adds
int4
quantization support to theEinsumDense
layer, and integrates it with LoRA.Description
int4
Quantization forEinsumDense
:_int4_build
method to create packedint4
kernels (stored asint8
) and associated scale variables._int4_call
forward pass that unpacks the kernel on-the-fly and uses a custom gradient for backpropagation.int4
Layers: LoRA can be enabled onint4
-quantizedEinsumDense
layers. Theenable_lora
method correctly infers the original, unpacked kernel shape, ensuring the LoRA matrices are created with the correct dimensions to operate in the full-precision space.Implementation Choices
For
EinsumDense
, theint4
packing is performed along the first kernel reduction-axis (_kernel_reduced_axes[0]
). This axis is chosen because it's analogous to the input feature dimension in a standardDense
layer, and could possibly be relatively large. We can consider making this customizable in the future.Benchmark
Text-Generation Micro-Benchmark with llama3: colab notebook
