-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Make Timestamps Returned by Snapshot APIs Consistent #43148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* We don't have to calculate the start and end times form the shards for the status API, we have the start time available from the CS or the `SnapshotInfo` in the repo and can either take the end time form the `SnapshotInfo` or take the most recent time from the shard stats for in progress snapshots * Closes elastic#43074
Pinging @elastic/es-distributed |
Jenkins run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for taking care of this. I left some minor comments
I agree that SnapshotStats would benefit from a clean up
@@ -46,6 +46,11 @@ | |||
SnapshotStats() { | |||
} | |||
|
|||
SnapshotStats(long startTime, long time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need another constructor just for this, maybe reuse the existing one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I guess I hated putting all those 0
but it's shorted this way for sure and horrible regardless :D
this.snapshot = Objects.requireNonNull(snapshot); | ||
this.state = Objects.requireNonNull(state); | ||
this.shards = Objects.requireNonNull(shards); | ||
this.includeGlobalState = includeGlobalState; | ||
shardsStats = new SnapshotShardsStats(shards); | ||
updateShardStats(); | ||
updateShardStats(startTime, time); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check that they are not < 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think we need to I think :) We get these numbers from org.elasticsearch.index.snapshots.IndexShardSnapshotStatus
which does all the checks internally so we should be good here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
builder.add(new SnapshotStatus(new Snapshot(repositoryName, snapshotId), state, | ||
Collections.unmodifiableList(shardStatusBuilder), snapshotInfo.includeGlobalState())); | ||
Collections.unmodifiableList(shardStatusBuilder), snapshotInfo.includeGlobalState(), | ||
startTime, snapshotInfo.endTime() - startTime)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that SnapshotInfo guarantees that endTime > startTime, maybe we should be defensive here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well no it doesn't but this isn't a new issue is it? It just shows up in two places now instead of in one. If you have skewed clocks and a master failover during snapshotting the absolute timestamp stored in the CS won't work out. All that happens now is that we see the same issue in two places instead of one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
import static org.hamcrest.Matchers.equalTo; | ||
import static org.hamcrest.Matchers.greaterThan; | ||
|
||
@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed, the default scope SUITE should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for a single test it doesn't matter anyway, but in general I'm not a big fan of the suit scope when running tests that so heavily depend on the exact state of the cluster like here -> that's why I went with test scope. Not sure if that makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think there is a reason here to use TEST and not the default, unless you noticed test failures with the SUITE test scope and if so we need to investigate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me :) Removed it.
After merging |
@tlrx I have to admit, your initial instincts about the possibility of negative timestamps were correct after all. I had to push 6223466 after merging
Can you take another look please? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left a comment about test scope which I think is important.
this.snapshot = Objects.requireNonNull(snapshot); | ||
this.state = Objects.requireNonNull(state); | ||
this.shards = Objects.requireNonNull(shards); | ||
this.includeGlobalState = includeGlobalState; | ||
shardsStats = new SnapshotShardsStats(shards); | ||
updateShardStats(); | ||
updateShardStats(startTime, time); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
import static org.hamcrest.Matchers.equalTo; | ||
import static org.hamcrest.Matchers.greaterThan; | ||
|
||
@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think there is a reason here to use TEST and not the default, unless you noticed test failures with the SUITE test scope and if so we need to investigate them.
builder.add(new SnapshotStatus(new Snapshot(repositoryName, snapshotId), state, | ||
Collections.unmodifiableList(shardStatusBuilder), snapshotInfo.includeGlobalState())); | ||
Collections.unmodifiableList(shardStatusBuilder), snapshotInfo.includeGlobalState(), | ||
startTime, snapshotInfo.endTime() - startTime)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
Jenkins run elasticsearch-ci/1 |
As a result of elastic#43148 neitehr of these apply: * First comment about relative time doesn't apply anymore since we use absolute time here again * The overall time in all snapshot APIs is now consistent and contains cleanup the time since it's not just calculated as the maximum of all shard upload times
As a result of #43148 neitehr of these apply: * First comment about relative time doesn't apply anymore since we use absolute time here again * The overall time in all snapshot APIs is now consistent and contains cleanup the time since it's not just calculated as the maximum of all shard upload times
* We don't have to calculate the start and end times form the shards for the status API, we have the start time available from the CS or the `SnapshotInfo` in the repo and can either take the end time form the `SnapshotInfo` or take the most recent time from the shard stats for in progress snapshots * Closes elastic#43074
* Since we're changing the way the snapshot status message is serialized in elastic#43148 we need to turn of BwC tests here
* We don't have to calculate the start and end times form the shards for the status API, we have the start time available from the CS or the `SnapshotInfo` in the repo and can either take the end time form the `SnapshotInfo` or take the most recent time from the shard stats for in progress snapshots * Closes #43074
* elastic#43148 has been backported to `7.4` -> we can reenable these tests after adjusting the version in teh serialization logic
* #43148 has been backported to `7.4` -> we can reenable these tests after adjusting the version in teh serialization logic
SnapshotInfo
in the repo and can either take the end time form theSnapshotInfo
ortake the most recent time from the shard stats for in progress snapshots
Admittedly the weird mutable data structures,
SnapshotStats
in particular becomes a little more magical than it already is with this change. I'd try to clean this up in a follow-up.