Attention mechanism has been a groundbreaking innovation in deep learning, and forms the backbone of the Transformer models, which powers the state-of-the-art language models like GPT4 and LLAMA. However, there is a persistent off-by-one bug in the traditional attention mechanism that can make the models harder to compress and deploy.
Introducing Quiet Attention, an innovative tweak to the traditional softmax function, allowing the attention heads to express 'no preference' and remain quiet. The slight adjustment to the denominator allows the vector to tend to zero if it prefers, rather than forcing the attention head to make an annotation.
This is a paper by Evan Miller, here's the link
Here's the modified formula for the softmax function, also referred to as "Softmax1" or "Quiet Attention" formula:
The critical difference between Softmax1 and traditional softmax lies in their negative limit behavior. In a scenario where all the entries in a vector are significantly less than zero and the model wants to avoid an annotation altogether, softmax_one allows it, unlike softmax.
Softmax1 essentially provides an 'escape hatch' when the attention head wants to remain quiet. The total output weight from Softmax1 varies based on the vector input, as opposed to softmax, which always emits the same total weight. This can significantly improve the model's performance, especially when dealing with noisy inputs.
Clone the repository:
git clone https://github.com/kyegomez/AttentionIsOFFByOne.git
cd AttentionIsOFFByOne
This repository contains extensive unit tests that aim to cover all possible scenarios and ensure the reliability of the solution. You can run the tests using the following command:
python -m unittest test.py
A benchmarking suite is included to compare the performance of the softmax_one
function with the PyTorch native softmax
function. We provide metrics across different tensor sizes to understand how they perform under varying loads.
To run the benchmarks, use the following command:
python benchmark.py
You can find the results in the benchmarks/results/
directory. The results include execution time and memory usage for each function across a variety of tensor sizes.
You can use the Softmax1 function just like you would use the traditional softmax function. Here's a simple example:
import torch
from softmax_one.softmax_one import softmax_one
x = torch.randn(5)
y = softmax_one(x, dim=0)
# Define the softmax_one function with added one in the denominator , which helps to reduce
#the negative impact impact of tiny values in the softmax function and improves numerical stability
def softmax_one(x, dim=None, _stacklevel=3, dtype=None):
#subtract the max for stability
x = x - x.max(dim=dim, keepdim=True).values
#compute exponentials
exp_x = torch.exp(x)
#compute softmax values and add on in the denominator
return exp_x / (1 + exp_x.sum(dim=dim, keepdim=True))
Contributions are welcome! Please submit a pull request or create an issue if you have any improvements or find any bugs.
This project is licensed under the MIT License - see the LICENSE
file for details.
It's really slow in basic python I will implement it in cuda
INFO:root:Running benchmark for tensor size (10, 10)...
INFO:root:F.softmax time: 0.0022182464599609375 s
INFO:root:softmax_one time: 0.04441571235656738 s
INFO:root:Running benchmark for tensor size (100, 100)...
INFO:root:F.softmax time: 0.01704573631286621 s
INFO:root:softmax_one time: 0.07482171058654785 s
INFO:root:Running benchmark for tensor size (1000, 1000)...
INFO:root:F.softmax time: 0.060335397720336914 s
INFO:root:softmax_one time: 3.0616047382354736 s
INFO:root:Running benchmark for tensor size (10000, 10000)...
INFO:root:F.softmax time: 52.80402970314026 s
INFO:root:softmax_one time: 128.78072810173035 s
INFO:root:Chart display is off.