Skip to content

Make Timestamps Returned by Snapshot APIs Consistent #43148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 5, 2019

Conversation

original-brownbear
Copy link
Contributor

  • We don't have to calculate the start and end times form the shards for the status API, we have the start time available from the CS or the SnapshotInfo in the repo and can either take the end time form the SnapshotInfo or
    take the most recent time from the shard stats for in progress snapshots
  • Had to add serialization of these time values since we can't compute them from the shards anymore
    • Fall back to computing time values from shards in BwC scenario
  • Started a new IT class for this since the existing ones are huge already and we have a few more API inconsistencies with snapshots to fix and add tests for
  • Closes Snapshot status API returns different timing information than snapshot info API #43074

Admittedly the weird mutable data structures, SnapshotStats in particular becomes a little more magical than it already is with this change. I'd try to clean this up in a follow-up.

* We don't have to calculate the start and end times form the shards for the status API, we have the start time available from the CS or the `SnapshotInfo` in the repo and can either take the end time form the `SnapshotInfo` or
take the most recent time from the shard stats for in progress snapshots
* Closes elastic#43074
@original-brownbear original-brownbear added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.3.0 labels Jun 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear original-brownbear changed the title Make Timestamps returned by Snapshot APIs Consistent Make Timestamps Returned by Snapshot APIs Consistent Jun 12, 2019
@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/2

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for taking care of this. I left some minor comments

I agree that SnapshotStats would benefit from a clean up

@@ -46,6 +46,11 @@
SnapshotStats() {
}

SnapshotStats(long startTime, long time) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need another constructor just for this, maybe reuse the existing one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I guess I hated putting all those 0 but it's shorted this way for sure and horrible regardless :D

this.snapshot = Objects.requireNonNull(snapshot);
this.state = Objects.requireNonNull(state);
this.shards = Objects.requireNonNull(shards);
this.includeGlobalState = includeGlobalState;
shardsStats = new SnapshotShardsStats(shards);
updateShardStats();
updateShardStats(startTime, time);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check that they are not < 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think we need to I think :) We get these numbers from org.elasticsearch.index.snapshots.IndexShardSnapshotStatus which does all the checks internally so we should be good here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

builder.add(new SnapshotStatus(new Snapshot(repositoryName, snapshotId), state,
Collections.unmodifiableList(shardStatusBuilder), snapshotInfo.includeGlobalState()));
Collections.unmodifiableList(shardStatusBuilder), snapshotInfo.includeGlobalState(),
startTime, snapshotInfo.endTime() - startTime));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that SnapshotInfo guarantees that endTime > startTime, maybe we should be defensive here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well no it doesn't but this isn't a new issue is it? It just shows up in two places now instead of in one. If you have skewed clocks and a master failover during snapshotting the absolute timestamp stored in the CS won't work out. All that happens now is that we see the same issue in two places instead of one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

import static org.hamcrest.Matchers.equalTo;
import static org.hamcrest.Matchers.greaterThan;

@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed, the default scope SUITE should be fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for a single test it doesn't matter anyway, but in general I'm not a big fan of the suit scope when running tests that so heavily depend on the exact state of the cluster like here -> that's why I went with test scope. Not sure if that makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think there is a reason here to use TEST and not the default, unless you noticed test failures with the SUITE test scope and if so we need to investigate them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me :) Removed it.

@original-brownbear
Copy link
Contributor Author

After merging master into this branch (even after resolving the compile issue) this is actually failing tests. Looking into it, something is off now with the timestamps somewhere.

@original-brownbear
Copy link
Contributor Author

@tlrx I have to admit, your initial instincts about the possibility of negative timestamps were correct after all. I had to push 6223466 after merging master. There were two issues here:

  1. The code I changed from using relative to absolute time needed absolute time. Otherwise, it will not work in a BwC scenario (because in those we will still use the time stamp on the shards to calculate the end and start times) so I moved all the logic in the repo and snapshot handling that was relevant to the same absolute time method on the thread pool to get things consistent here
  2. The logic of calculating the start and end time needs to be disabled when using timestamps from SnapshotInfo. This only worked fine when all the code was using System.currentTimeMillis but stopped working now with cached time. Since all the nodes in the test cluster use different thread pools that was simulating clock skew, which lead to the calculation not working out anymore (since due to skew a shard could look like it started before the snapshot itself or finished after).

Can you take another look please? :)

@original-brownbear original-brownbear requested a review from tlrx July 4, 2019 14:54
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left a comment about test scope which I think is important.

this.snapshot = Objects.requireNonNull(snapshot);
this.state = Objects.requireNonNull(state);
this.shards = Objects.requireNonNull(shards);
this.includeGlobalState = includeGlobalState;
shardsStats = new SnapshotShardsStats(shards);
updateShardStats();
updateShardStats(startTime, time);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

import static org.hamcrest.Matchers.equalTo;
import static org.hamcrest.Matchers.greaterThan;

@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think there is a reason here to use TEST and not the default, unless you noticed test failures with the SUITE test scope and if so we need to investigate them.

builder.add(new SnapshotStatus(new Snapshot(repositoryName, snapshotId), state,
Collections.unmodifiableList(shardStatusBuilder), snapshotInfo.includeGlobalState()));
Collections.unmodifiableList(shardStatusBuilder), snapshotInfo.includeGlobalState(),
startTime, snapshotInfo.endTime() - startTime));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/1

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jul 5, 2019
As a result of elastic#43148 neitehr of these apply:
* First comment about relative time doesn't apply anymore since we use absolute time here again
* The overall time in all snapshot APIs is now consistent and contains cleanup the time since it's not just calculated as the maximum of all shard upload times
original-brownbear added a commit that referenced this pull request Jul 8, 2019
As a result of #43148 neitehr of these apply:
* First comment about relative time doesn't apply anymore since we use absolute time here again
* The overall time in all snapshot APIs is now consistent and contains cleanup the time since it's not just calculated as the maximum of all shard upload times
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jul 12, 2019
* We don't have to calculate the start and end times form the shards for the status API, we have the start time available from the CS or the `SnapshotInfo` in the repo and can either take the end time form the `SnapshotInfo` or
take the most recent time from the shard stats for in progress snapshots
* Closes elastic#43074
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jul 12, 2019
* Since we're changing the way the snapshot status message is serialized
in elastic#43148 we need to turn of BwC tests here
original-brownbear added a commit that referenced this pull request Jul 12, 2019
* We don't have to calculate the start and end times form the shards for the status API, we have the start time available from the CS or the `SnapshotInfo` in the repo and can either take the end time form the `SnapshotInfo` or
take the most recent time from the shard stats for in progress snapshots
* Closes #43074
original-brownbear added a commit that referenced this pull request Jul 12, 2019
* Since we're changing the way the snapshot status message is serialized
in #43148 we need to turn of BwC tests here
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jul 12, 2019
* elastic#43148 has been backported to `7.4` -> we can reenable these tests
after adjusting the version in teh serialization logic
original-brownbear added a commit that referenced this pull request Jul 12, 2019
* #43148 has been backported to `7.4` -> we can reenable these tests
after adjusting the version in teh serialization logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Snapshot status API returns different timing information than snapshot info API
5 participants