Clarification on LSSAR implementation and properties in Table 1 and Table 2

Hello TDA team,

I am Bo Gao, the first author of LSSAR ("Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models", published at ICML 2026 / arXiv:2501.13428). 

First of all, congratulations on the acceptance of your paper, TDA, at ACL 2026! I read your paper with great interest. However, I noticed some factual errors regarding LSSAR in your experimental comparison (Table 1 and Table 2) that I would like to clarify.

### 1. Table 1: "Exact 0: ✗" for LSSAR is incorrect

In Table 1, LSSAR is categorised as not producing exact zeros ("Exact 0: ✗"). This is factually incorrect.

LSSAR decomposes standard softmax attention and, in its second stage (Re-weighting), utilises a **Shift-ReLU gate** as the core sparsification mechanism:

$$g_{ij} = \mathrm{ReLU}(a_{ij} \cdot N - 1)^p$$

where $a_{ij}$ is the L1-normalised attention weight and $N$ is the sequence length (or patch count). 

Because the Shift-ReLU gate applies the standard `ReLU` activation function to the centred attention weights $a_{ij} \cdot N - 1$, any attention weight that is less than or equal to the uniform distribution baseline (i.e., $a_{ij} \le 1/N$) is set to **exactly 0.0**. The remaining positive weights are then renormalised. Thus, LSSAR is specifically designed to produce exact floating-point zeros, not near-zero approximations.

The correct entry for LSSAR in Table 1 should be:

| Property | Correct Value |
|---|:---:|
| Exact 0 | **✓** |
| Negative | ✗ |
| No sum-to-1 | Partial (surviving tokens are renormalised) |
| Length-aware | ✓ |

### 2. Table 2: 0% Sparsity indicates an implementation issue

In Table 2, your benchmark results report LSSAR's sparsity as **0%** (fully dense attention), with a validation loss of 3.1676, which is worse than standard Softmax (3.1196).

With a correct implementation of the Shift-ReLU gate, LSSAR typically achieves **50% to 90% attention sparsity** depending on the hyperparameter $p$, while matching or exceeding Softmax in language modelling performance. 

Reporting 0% sparsity suggests that your implementation of LSSAR either omitted the Shift-ReLU gate entirely (e.g., only replacing the exponential activation with Softplus) or had an implementation bug where the gate never activated, rendering it a dense Softplus attention baseline rather than LSSAR.

### Request for Correction

Could you please:
1. Correct Table 1 in future revisions/errata to show that LSSAR produces exact zeros (✓).
2. Re-run LSSAR using a correct implementation (our official open-source repository is available at: https://github.com/iminfine/freeattn) or amend the paper/benchmark description to clarify this baseline's implementation details.

Thank you for your time and contribution to the community!

Best regards,  
Bo Gao

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on LSSAR implementation and properties in Table 1 and Table 2 #2

1. Table 1: "Exact 0: ✗" for LSSAR is incorrect

2. Table 2: 0% Sparsity indicates an implementation issue

Request for Correction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Property	Correct Value
Exact 0	✓
Negative	✗
No sum-to-1	Partial (surviving tokens are renormalised)
Length-aware	✓

Uh oh!

Clarification on LSSAR implementation and properties in Table 1 and Table 2 #2

Description

1. Table 1: "Exact 0: ✗" for LSSAR is incorrect

2. Table 2: 0% Sparsity indicates an implementation issue

Request for Correction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions