Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect unsynchronized contract table and retry #10617

Merged
merged 42 commits into from
Sep 28, 2021

Conversation

S11001001
Copy link
Contributor

@S11001001 S11001001 commented Aug 18, 2021

After the websocket DB query, we check the offset table to ensure that it's consistent with what we just wrote to it in the fetch-and-persist phase. If it isn't, we try again. (We can always use the offset that we just wrote in our final response body, because if it doesn't match, we're going to use the new offset and re-fetch anyway.)

We try 10 times, and if we still don't get consistent results, we fail the WS request. I haven't explored more complex backoff strategies here; the focus of this PR is getting a minimal update that is least likely to livelock (and thus exhaust all retries).

We could also use the fetchAndPersistBracket function for the non-WS queries. I'm not sure that's necessary, though.

Fixes #10334. See that issue for a longer explanation of the possible race conditions this PR addresses.

CHANGELOG_BEGIN
- [JSON API] Under rare conditions, a multi-template query backed by database
  could have an ACS portion that doesn't match its transaction stream, if
  updated concurrently.  This conditions is now checked and accounted for.
  See `issue #10617 <https://github.com/digital-asset/daml/pull/10617>`__.
CHANGELOG_END
  • get rid of XTag/XX unless I need it for a different minimumViableOffsets approach

Pull Request Checklist

  • Read and understand the contribution guidelines
  • Include appropriate tests
  • Set a descriptive title and thorough description
  • Add a reference to the issue this PR will solve, if appropriate
  • Include changelog additions in one or more commit message bodies between the CHANGELOG_BEGIN and CHANGELOG_END tags
  • Normal production system change, include purpose of change in description

NOTE: CI is not automatically run on non-members pull-requests for security
reasons. The reviewer will have to comment with /AzurePipelines run to
trigger the build.

- Writing only the latest-read, as before, would imply unsynced offsets
  that are actually well-synced.  This puts the DB in a more uniform
  state, i.e. it should actually reflect the single value that the
  fetchAndPersist loop tries to catch everything up to.
- Treating every unsynced offset as a lag would make us needlessly retry
  perfectly synchronized query results.
@S11001001 S11001001 added the component/json-api HTTP JSON API label Aug 18, 2021
@S11001001 S11001001 added this to the HTTP JSON API Maintenance milestone Aug 18, 2021
@S11001001 S11001001 self-assigned this Aug 18, 2021
@stefanobaghino-da stefanobaghino-da added the team/ledger-clients Related to the Ledger Clients team's components. label Aug 31, 2021
CHANGELOG_BEGIN
- [JSON API] Under rare conditions, a multi-template query backed by database
  could have an ACS portion that doesn't match its transaction stream, if
  updated concurrently.  This conditions is now checked and accounted for.
  See `issue #10617 <https://github.com/digital-asset/daml/pull/10617>`__.
CHANGELOG_END
Copy link
Contributor

@stefanobaghino-da stefanobaghino-da left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this impact performance? Is there some visible effect on existing performance tests?

val maxLocalOff = partyOffs
.collect {
case (unsyncedParty, unsyncedOff)
if queriedParty(unsyncedParty) || unsyncedOff > expectedOffset =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should be the behaviour for an unsyncedParty which is not a queriedParty and has unsyncedOff < expectedOffset ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can safely ignore it. This is the test "not require update for lagging, but unqueried, party".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking of it as an optimization, but in reality it's a must-have. If an unqueried party is lagging, we don't have the token to update it, so if we decided to update based on that, we'd get stuck in the loop and exhaust retries.

val singleton = Set(surrogateTpId)
// if a queried party is not in the map, we can safely assume that
// it is exactly at expectedOffset
val caughtUp = maxLocalOff <= expectedOffset ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again slightly confused on how maxLocalOff being < expectedOffset is considered as caughtUp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the invariant that every queriedParty is already at least expectedOffset, but it may not be in the map. Therefore, if the condition you mention holds, then maxLocalOff must have come from an unqueried party, which is fine as stated in the prior comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way, we care only that the queriedParty is caught up.

Copy link
Contributor

@akshayshirahatti-da akshayshirahatti-da left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functionally looks fine to me, thanks for explaining 👍

}

"report desync, but no updates, if consistent" in {
mvo(3, Map(0 -> everyParty(5), 1 -> everyParty(5))) should ===(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC expectedOffset should come from one of the queried parties right ? so is this scenario ever possible ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedOffset is only "what we just wrote"; it never comes from reading. So this can happen if all these template-IDs/parties got updated from 3 to 5 after our own fetch.

@S11001001
Copy link
Contributor Author

How will this impact performance? Is there some visible effect on existing performance tests?

@stefanobaghino-da There's no visible effect on AsyncQueryConstantAcs between 9641fd5 and 5da2571. The ContractDao benchmarks don't use ContractsFetch so have no code changes.

A couple things on concurrent updates:

  1. If we rerun, we don't rollback; we read forward as if we were just starting the request. So successive retries won't redo any fetching that was already done in previous tries.
  2. If you want to follow the common (no concurrent update) case while thinking about the code, it's as follows:
    1. Queries#unsyncedOffsets returns a map whose keys are all the template IDs and values are empty maps; the query it does has no rows.
    2. In minimumViableOffsets,
      1. partyOffs always has only unqueried parties;
      2. maxLocalOff is always expectedOffset;
      3. caughtUp is always true by the first branch;
      4. lagging is always Some, but minimumViableOffsets will return None.
    3. In the Lagginess append,
      1. offA === offB;
      2. nc, ncA, and ncB are empty;
      3. c accumulates all the surrogate template IDs we're querying.

@mergify mergify bot merged commit b4d0031 into main Sep 28, 2021
@mergify mergify bot deleted the 10334-detect-contract-table-desync branch September 28, 2021 20:47
azure-pipelines bot pushed a commit that referenced this pull request Sep 29, 2021
This PR has been created by a script, which is not very smart
and does not have all the context. Please do double-check that
the version prefix is correct before merging.

@stefanobaghino-da is in charge of this release.

Commit log:
```
b4d0031 detect unsynchronized contract table and retry (#10617)
3d779cf [Mutable state cache] Fix initialization offset (#11024)
5458aa8 Switch on NonUnitStatements warning in daml-lf/transaction (#11048)
9641fd5 auth middleware: no print secret (#11050)
a885f52 [DPP-417] [DPP-595] Add error code version switching mechanism (#11035)
12e0c72 fix blackduck logic (#11049)
eb87b34 kvutils: Add the logging context for ledger state operations. (#11030)
b7daa5f Address remaining dependabot alerts (#11045)
5e43f8c es: drop jobs-* indices (#10857)
03203b7 Define encoding/decoding for module imports (#11036)
57a1597 Setting timeoutToleranceMillis to 10 minutes to prevent flakiness (#11043)
df59f3f Fix Navigator dependabot alerts (#11044)
6bf45a3 Upgrade Navigator to Webpack 5 (#11040)
80e217e [DPP-622] Add conformance tests that verifies TLSv1.0 and TLSv1 are disabled. (#10983)
626e1fb Small lf value.cids regression fix (found by canton unit tests) (#11032)
a4629a4 KV: Ignore daml_lf_1.proto when checking for KV protobuf compatibility (#11021)
ee9be65 kvutils: Add metadata to `Err` [KVL-1032] (#10992)
0d3ae6e interface methods: Haskell Typechecker (#11028)
e36eb46 Resolve `set-value` to 4.0.1 and above (#11029)
4075624 interface methods: Haskell AST (#11018)
7c1fd50 Bump grunt-browserify (#11026)
5f3f582 Upgrade webpack-dev-server in Navigator (#11025)
91be1e1 Drop matchdep dependency from docs build (#11023)
fe9aeff Increase es disk size (#11019)
eac7963 LF: Refactor ProtoTest.scala (#11020)
e79a30a For the client binding propagate the full original completion [KVL-1112] (#10879)
59ad995 fix buf check (#11014)
abc3e66 Increase the tolarance for handover of control in HaCoordinatorSpec (#10997)
19b2bf4 LF: Cosmetic clean-up in the Speedy Compiler (#11015)
cb0e41f LF: Add interface support to the Preprocessor (#11013)
c33297c Remove `transactionNormalization` flag. (#11010)
f5d2135 Check protobuf compatibility of `main` and PR commits w.r.t. previous stable release [KVL-1109] (#10950)
bf8b75d interface methods: Add protobuf definitions. (#11005)
88e1430 Make LargeTransactionTest use ValueEnricher, so it can work with normalized transactions coming out of the engine. (#11003)
a043926 rotate release duty after 1.17.0-snapshot.20210921.7889.0.1b473c2b (#10972)
35666ca Add MINIMAL pragma for Additive type class (#11001)
8de162b [DPP-586] Upgrade to netty 4.1.67.Final and netty-tcnative-boringssl-static 2.0.40.Final (#10956)
871d03b release 1.18.0-snapshot.20210922.7908.0.ced4a272 (#10998)
721575e [JSON-API] Postgres perf job (#10986)
f2d9f07 Release RC2 for SDK 1.17.0 (#10996)
```
Changelog:
```
- [JSON API] Under rare conditions, a multi-template query backed by database
  could have an ACS portion that doesn't match its transaction stream, if
  updated concurrently.  This conditions is now checked and accounted for.
  See `issue #10617 <https://github.com/digital-asset/daml/pull/10617>`__.
- The OAuth2 Middleware now obfuscates its Client Secret when logging
  its config.
- [Integration Kit] We have added ``loggingContext`` as an implicit
  parameter to more _kvutils_ trait methods. Implementors may need to do
  the same in their trait implementations. This can make it easier to
  log with the appropriate context.
java-client-bindings - the original full completion is included in the `CompletionResponse` when available
Daml on SQL, Integration Kit, Sandbox: Drop support for TLS 1.0 and 1.1 in Ledger API.
```

CHANGELOG_BEGIN
CHANGELOG_END
stefanobaghino-da pushed a commit that referenced this pull request Sep 29, 2021
This PR has been created by a script, which is not very smart
and does not have all the context. Please do double-check that
the version prefix is correct before merging.

@stefanobaghino-da is in charge of this release.

Commit log:
```
b4d0031 detect unsynchronized contract table and retry (#10617)
3d779cf [Mutable state cache] Fix initialization offset (#11024)
5458aa8 Switch on NonUnitStatements warning in daml-lf/transaction (#11048)
9641fd5 auth middleware: no print secret (#11050)
a885f52 [DPP-417] [DPP-595] Add error code version switching mechanism (#11035)
12e0c72 fix blackduck logic (#11049)
eb87b34 kvutils: Add the logging context for ledger state operations. (#11030)
b7daa5f Address remaining dependabot alerts (#11045)
5e43f8c es: drop jobs-* indices (#10857)
03203b7 Define encoding/decoding for module imports (#11036)
57a1597 Setting timeoutToleranceMillis to 10 minutes to prevent flakiness (#11043)
df59f3f Fix Navigator dependabot alerts (#11044)
6bf45a3 Upgrade Navigator to Webpack 5 (#11040)
80e217e [DPP-622] Add conformance tests that verifies TLSv1.0 and TLSv1 are disabled. (#10983)
626e1fb Small lf value.cids regression fix (found by canton unit tests) (#11032)
a4629a4 KV: Ignore daml_lf_1.proto when checking for KV protobuf compatibility (#11021)
ee9be65 kvutils: Add metadata to `Err` [KVL-1032] (#10992)
0d3ae6e interface methods: Haskell Typechecker (#11028)
e36eb46 Resolve `set-value` to 4.0.1 and above (#11029)
4075624 interface methods: Haskell AST (#11018)
7c1fd50 Bump grunt-browserify (#11026)
5f3f582 Upgrade webpack-dev-server in Navigator (#11025)
91be1e1 Drop matchdep dependency from docs build (#11023)
fe9aeff Increase es disk size (#11019)
eac7963 LF: Refactor ProtoTest.scala (#11020)
e79a30a For the client binding propagate the full original completion [KVL-1112] (#10879)
59ad995 fix buf check (#11014)
abc3e66 Increase the tolarance for handover of control in HaCoordinatorSpec (#10997)
19b2bf4 LF: Cosmetic clean-up in the Speedy Compiler (#11015)
cb0e41f LF: Add interface support to the Preprocessor (#11013)
c33297c Remove `transactionNormalization` flag. (#11010)
f5d2135 Check protobuf compatibility of `main` and PR commits w.r.t. previous stable release [KVL-1109] (#10950)
bf8b75d interface methods: Add protobuf definitions. (#11005)
88e1430 Make LargeTransactionTest use ValueEnricher, so it can work with normalized transactions coming out of the engine. (#11003)
a043926 rotate release duty after 1.17.0-snapshot.20210921.7889.0.1b473c2b (#10972)
35666ca Add MINIMAL pragma for Additive type class (#11001)
8de162b [DPP-586] Upgrade to netty 4.1.67.Final and netty-tcnative-boringssl-static 2.0.40.Final (#10956)
871d03b release 1.18.0-snapshot.20210922.7908.0.ced4a272 (#10998)
721575e [JSON-API] Postgres perf job (#10986)
f2d9f07 Release RC2 for SDK 1.17.0 (#10996)
```
Changelog:
```
- [JSON API] Under rare conditions, a multi-template query backed by database
  could have an ACS portion that doesn't match its transaction stream, if
  updated concurrently.  This conditions is now checked and accounted for.
  See `issue #10617 <https://github.com/digital-asset/daml/pull/10617>`__.
- The OAuth2 Middleware now obfuscates its Client Secret when logging
  its config.
- [Integration Kit] We have added ``loggingContext`` as an implicit
  parameter to more _kvutils_ trait methods. Implementors may need to do
  the same in their trait implementations. This can make it easier to
  log with the appropriate context.
java-client-bindings - the original full completion is included in the `CompletionResponse` when available
Daml on SQL, Integration Kit, Sandbox: Drop support for TLS 1.0 and 1.1 in Ledger API.
```

CHANGELOG_BEGIN
CHANGELOG_END

Co-authored-by: Azure Pipelines DAML Build <support@digitalasset.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/json-api HTTP JSON API team/ledger-clients Related to the Ledger Clients team's components.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

handle concurrent-update race condition for DB-backed websocket queries
3 participants