-
Notifications
You must be signed in to change notification settings - Fork 707
invalid lso fix #29112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
invalid lso fix #29112
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes an issue where rm_stm incorrectly initializes the last stable offset (LSO) to -1 instead of 0, which is the proper value for an invalid LSO. Since invalid LSO (0) is used throughout the codebase for error translation into retry-able consumer errors, this incorrect initialization prevents proper error handling.
Key Changes
- Updated
rm_stm::last_stable_offset()to initialize LSO withmodel::invalid_lsoinstead of -1 - Updated
source_partition_offsets::last_stable_offsetdefault value to usemodel::offset_cast(model::invalid_lso)instead of -1
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/v/cluster/rm_stm.cc | Changed local variable initialization from -1 to model::invalid_lso in last_stable_offset() function |
| src/v/kafka/client/direct_consumer/api_types.h | Changed struct member default initialization from -1 to model::offset_cast(model::invalid_lso) |
CI test resultstest results on build#78548
|
|
link the jira? |
|
|
||
| auto synced_leader = _raft->is_leader() && _raft->term() == _insync_term; | ||
| model::offset lso{-1}; | ||
| model::offset lso{model::invalid_lso}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: the new behavior is that it returns offset_not_available and that triggers a retry?
| return error_code::offset_not_available; |
If yes, I wonder (without this fix) if the translation here would throw an exception because it attempts translation before the start offset? How is it returning -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the logs
TRACE 2025-12-23 01:45:37,297 [shard 1:fetc] tx - [{kafka/source-topic/0}] - rm_stm.cc:1322 - lso update in progress, last_known_lso: -1, last_applied: 0
from the code
auto maybe_lso = _partition->last_stable_offset();
if (maybe_lso == model::invalid_lso) {
return error_code::offset_not_available;
}
return _translator->from_log_offset(maybe_lso);
and then translator is actually just a translator_state, code here
model::offset offset_translator_state::from_log_offset(model::offset o) const {
const auto d = delta(o);
return model::offset(o - d);
}
no checks on valid offset nor exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is a +1 for trying to use name constants as much as possible as opposed to magic numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to be consistent. We track high watermark as the inclusive offset and add one when it is returned. Would it make sense to track LSO in the same way i.e. leave it as -1 and only add 1 when we need to retrieve it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We track high watermark as the inclusive offset and add one when it is returned
Can you point me in the direction of where this occurs?
I poked around and it seems hwm and lso get the same treatment almost everywhere in code. Its possible that lso actually gets an increment on some fetch path, in which case, maybe invalid_lso should actually be -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which case, maybe invalid_lso should actually be -1
that makes sense.. hopefully nothing breaks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests pass, looks good afaik
no jira link, this was a failure flagged in private team core |
ecfa185 to
4fab444
Compare
4fab444 to
c5f985a
Compare
Use named constants when possible.
c5f985a to
faf1868
Compare
|
letting ci run its course |
|
/backport v25.3.x |
|
Failed to create a backport PR to v25.3.x branch. I tried: |
rm_stm instantiated its lso to -1, where invalid_lso is 0
Normally, invalid_lso gets converted into lso_unavailable in the fetch path, but -1 != 0, so direct consumer received an lso of -1, thus firing an error log.
This pr changes invalid lso to be -1 in parity with hwm, and then spot updates places where lso's are being instantiated to the magic number "-1" to instead use invalid_lso.
Backports Required
Release Notes
Bug Fixes