Skip to content

Conversation

@joe-redpanda
Copy link
Contributor

@joe-redpanda joe-redpanda commented Dec 23, 2025

rm_stm instantiated its lso to -1, where invalid_lso is 0
Normally, invalid_lso gets converted into lso_unavailable in the fetch path, but -1 != 0, so direct consumer received an lso of -1, thus firing an error log.

This pr changes invalid lso to be -1 in parity with hwm, and then spot updates places where lso's are being instantiated to the magic number "-1" to instead use invalid_lso.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Bug Fixes

  • guarantee invalid lso gets converted to a retryable error when relevant

@joe-redpanda joe-redpanda changed the title Dc fix err log rm_stm invalid lso fix Jan 5, 2026
@joe-redpanda joe-redpanda marked this pull request as ready for review January 5, 2026 19:42
Copilot AI review requested due to automatic review settings January 5, 2026 19:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an issue where rm_stm incorrectly initializes the last stable offset (LSO) to -1 instead of 0, which is the proper value for an invalid LSO. Since invalid LSO (0) is used throughout the codebase for error translation into retry-able consumer errors, this incorrect initialization prevents proper error handling.

Key Changes

  • Updated rm_stm::last_stable_offset() to initialize LSO with model::invalid_lso instead of -1
  • Updated source_partition_offsets::last_stable_offset default value to use model::offset_cast(model::invalid_lso) instead of -1

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/v/cluster/rm_stm.cc Changed local variable initialization from -1 to model::invalid_lso in last_stable_offset() function
src/v/kafka/client/direct_consumer/api_types.h Changed struct member default initialization from -1 to model::offset_cast(model::invalid_lso)

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#78548
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MasterTestSuite test_remote_partition_read_cached_index unit https://buildkite.com/redpanda/redpanda/builds/78548#019b8faf-6c04-4807-ba72-5c2e71ad3658 FAIL 0/1

@bharathv
Copy link
Contributor

bharathv commented Jan 5, 2026

link the jira?


auto synced_leader = _raft->is_leader() && _raft->term() == _insync_term;
model::offset lso{-1};
model::offset lso{model::invalid_lso};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: the new behavior is that it returns offset_not_available and that triggers a retry?

return error_code::offset_not_available;

If yes, I wonder (without this fix) if the translation here would throw an exception because it attempts translation before the start offset? How is it returning -1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the logs

TRACE 2025-12-23 01:45:37,297 [shard 1:fetc] tx - [{kafka/source-topic/0}] - rm_stm.cc:1322 - lso update in progress, last_known_lso: -1, last_applied: 0

from the code

    auto maybe_lso = _partition->last_stable_offset();
    if (maybe_lso == model::invalid_lso) {
        return error_code::offset_not_available;
    }
    return _translator->from_log_offset(maybe_lso);

and then translator is actually just a translator_state, code here

model::offset offset_translator_state::from_log_offset(model::offset o) const {
    const auto d = delta(o);
    return model::offset(o - d);
}

no checks on valid offset nor exceptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is a +1 for trying to use name constants as much as possible as opposed to magic numbers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to be consistent. We track high watermark as the inclusive offset and add one when it is returned. Would it make sense to track LSO in the same way i.e. leave it as -1 and only add 1 when we need to retrieve it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We track high watermark as the inclusive offset and add one when it is returned
Can you point me in the direction of where this occurs?

I poked around and it seems hwm and lso get the same treatment almost everywhere in code. Its possible that lso actually gets an increment on some fetch path, in which case, maybe invalid_lso should actually be -1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case, maybe invalid_lso should actually be -1

that makes sense.. hopefully nothing breaks :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests pass, looks good afaik

@joe-redpanda joe-redpanda requested a review from bharathv January 6, 2026 22:02
@joe-redpanda
Copy link
Contributor Author

link the jira?

no jira link, this was a failure flagged in private team core

Use named constants when possible.
@joe-redpanda joe-redpanda changed the title rm_stm invalid lso fix [[do not review]] rm_stm invalid lso fix Jan 8, 2026
@joe-redpanda
Copy link
Contributor Author

letting ci run its course

@joe-redpanda joe-redpanda changed the title [[do not review]] rm_stm invalid lso fix invalid lso fix Jan 9, 2026
@joe-redpanda joe-redpanda merged commit 0255b8b into redpanda-data:dev Jan 9, 2026
22 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v25.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29112-v25.3.x-367 remotes/upstream/v25.3.x
git cherry-pick -x 621a4d90b2 faf1868050

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants