Hello TDA team,
I am Bo Gao, the first author of LSSAR ("Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models", published at ICML 2026 / arXiv:2501.13428).
First of all, congratulations on the acceptance of your paper, TDA, at ACL 2026! I read your paper with great interest. However, I noticed some factual errors regarding LSSAR in your experimental comparison (Table 1 and Table 2) that I would like to clarify.
1. Table 1: "Exact 0: ✗" for LSSAR is incorrect
In Table 1, LSSAR is categorised as not producing exact zeros ("Exact 0: ✗"). This is factually incorrect.
LSSAR decomposes standard softmax attention and, in its second stage (Re-weighting), utilises a Shift-ReLU gate as the core sparsification mechanism:
$$g_{ij} = \mathrm{ReLU}(a_{ij} \cdot N - 1)^p$$
where $a_{ij}$ is the L1-normalised attention weight and $N$ is the sequence length (or patch count).
Because the Shift-ReLU gate applies the standard ReLU activation function to the centred attention weights $a_{ij} \cdot N - 1$, any attention weight that is less than or equal to the uniform distribution baseline (i.e., $a_{ij} \le 1/N$) is set to exactly 0.0. The remaining positive weights are then renormalised. Thus, LSSAR is specifically designed to produce exact floating-point zeros, not near-zero approximations.
The correct entry for LSSAR in Table 1 should be:
| Property |
Correct Value |
| Exact 0 |
✓ |
| Negative |
✗ |
| No sum-to-1 |
Partial (surviving tokens are renormalised) |
| Length-aware |
✓ |
2. Table 2: 0% Sparsity indicates an implementation issue
In Table 2, your benchmark results report LSSAR's sparsity as 0% (fully dense attention), with a validation loss of 3.1676, which is worse than standard Softmax (3.1196).
With a correct implementation of the Shift-ReLU gate, LSSAR typically achieves 50% to 90% attention sparsity depending on the hyperparameter $p$, while matching or exceeding Softmax in language modelling performance.
Reporting 0% sparsity suggests that your implementation of LSSAR either omitted the Shift-ReLU gate entirely (e.g., only replacing the exponential activation with Softplus) or had an implementation bug where the gate never activated, rendering it a dense Softplus attention baseline rather than LSSAR.
Request for Correction
Could you please:
- Correct Table 1 in future revisions/errata to show that LSSAR produces exact zeros (✓).
- Re-run LSSAR using a correct implementation (our official open-source repository is available at: https://github.com/iminfine/freeattn) or amend the paper/benchmark description to clarify this baseline's implementation details.
Thank you for your time and contribution to the community!
Best regards,
Bo Gao
Hello TDA team,
I am Bo Gao, the first author of LSSAR ("Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models", published at ICML 2026 / arXiv:2501.13428).
First of all, congratulations on the acceptance of your paper, TDA, at ACL 2026! I read your paper with great interest. However, I noticed some factual errors regarding LSSAR in your experimental comparison (Table 1 and Table 2) that I would like to clarify.
1. Table 1: "Exact 0: ✗" for LSSAR is incorrect
In Table 1, LSSAR is categorised as not producing exact zeros ("Exact 0: ✗"). This is factually incorrect.
LSSAR decomposes standard softmax attention and, in its second stage (Re-weighting), utilises a Shift-ReLU gate as the core sparsification mechanism:
where$a_{ij}$ is the L1-normalised attention weight and $N$ is the sequence length (or patch count).
Because the Shift-ReLU gate applies the standard$a_{ij} \cdot N - 1$ , any attention weight that is less than or equal to the uniform distribution baseline (i.e., $a_{ij} \le 1/N$ ) is set to exactly 0.0. The remaining positive weights are then renormalised. Thus, LSSAR is specifically designed to produce exact floating-point zeros, not near-zero approximations.
ReLUactivation function to the centred attention weightsThe correct entry for LSSAR in Table 1 should be:
2. Table 2: 0% Sparsity indicates an implementation issue
In Table 2, your benchmark results report LSSAR's sparsity as 0% (fully dense attention), with a validation loss of 3.1676, which is worse than standard Softmax (3.1196).
With a correct implementation of the Shift-ReLU gate, LSSAR typically achieves 50% to 90% attention sparsity depending on the hyperparameter$p$ , while matching or exceeding Softmax in language modelling performance.
Reporting 0% sparsity suggests that your implementation of LSSAR either omitted the Shift-ReLU gate entirely (e.g., only replacing the exponential activation with Softplus) or had an implementation bug where the gate never activated, rendering it a dense Softplus attention baseline rather than LSSAR.
Request for Correction
Could you please:
Thank you for your time and contribution to the community!
Best regards,
Bo Gao