Skip to content

Clarification on LSSAR implementation and properties in Table 1 and Table 2 #2

Description

@iminfine

Hello TDA team,

I am Bo Gao, the first author of LSSAR ("Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models", published at ICML 2026 / arXiv:2501.13428).

First of all, congratulations on the acceptance of your paper, TDA, at ACL 2026! I read your paper with great interest. However, I noticed some factual errors regarding LSSAR in your experimental comparison (Table 1 and Table 2) that I would like to clarify.

1. Table 1: "Exact 0: ✗" for LSSAR is incorrect

In Table 1, LSSAR is categorised as not producing exact zeros ("Exact 0: ✗"). This is factually incorrect.

LSSAR decomposes standard softmax attention and, in its second stage (Re-weighting), utilises a Shift-ReLU gate as the core sparsification mechanism:

$$g_{ij} = \mathrm{ReLU}(a_{ij} \cdot N - 1)^p$$

where $a_{ij}$ is the L1-normalised attention weight and $N$ is the sequence length (or patch count).

Because the Shift-ReLU gate applies the standard ReLU activation function to the centred attention weights $a_{ij} \cdot N - 1$, any attention weight that is less than or equal to the uniform distribution baseline (i.e., $a_{ij} \le 1/N$) is set to exactly 0.0. The remaining positive weights are then renormalised. Thus, LSSAR is specifically designed to produce exact floating-point zeros, not near-zero approximations.

The correct entry for LSSAR in Table 1 should be:

Property Correct Value
Exact 0
Negative
No sum-to-1 Partial (surviving tokens are renormalised)
Length-aware

2. Table 2: 0% Sparsity indicates an implementation issue

In Table 2, your benchmark results report LSSAR's sparsity as 0% (fully dense attention), with a validation loss of 3.1676, which is worse than standard Softmax (3.1196).

With a correct implementation of the Shift-ReLU gate, LSSAR typically achieves 50% to 90% attention sparsity depending on the hyperparameter $p$, while matching or exceeding Softmax in language modelling performance.

Reporting 0% sparsity suggests that your implementation of LSSAR either omitted the Shift-ReLU gate entirely (e.g., only replacing the exponential activation with Softplus) or had an implementation bug where the gate never activated, rendering it a dense Softplus attention baseline rather than LSSAR.

Request for Correction

Could you please:

  1. Correct Table 1 in future revisions/errata to show that LSSAR produces exact zeros (✓).
  2. Re-run LSSAR using a correct implementation (our official open-source repository is available at: https://github.com/iminfine/freeattn) or amend the paper/benchmark description to clarify this baseline's implementation details.

Thank you for your time and contribution to the community!

Best regards,
Bo Gao

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions