Skip to content

Jepsen: multiple conflicting appends sometimes succeed #2765

@snaury

Description

@snaury

I rewrote jepsen test to merge multiple micro ops into a single query (does UPSERT from SELECT with LEFT JOIN and GROUP BY) and started to get inconsistent read failures. It appears that multiple transactions claim the same list index and successfully overwrite it, even though only one transaction should have succeeded.

A good example of such failure:

  • We append to key 192
  • One transaction appends 630, another appends 637, both succeeds
  • Other transactions see the list ending in 630, but then it changes to 637

History has this info on two appends:

# append 630
{:index 300389, :time 240680658687, :type :ok, :process 1547, :f :txn,
    :value [
        [:append 192 630]
        [:append 3 52]],
    :modified-txn [
        [:batch #{192 3} #jepsen.ydb.append.OperationBatch{:ops [[:append 192 630] [:append 3 52]], :reads [], :writes [[192 630] [3 52]], :readmap {}}]],
    :debug-info [
        [
            {:tablet 72075186226226475, :gen 25, :read-version "v1710410837721/18446744073709551615", :op "read", :node 50003, :read-id 2, :snapshot-repeatable true, :lock-tx-id 844430958736899}
            {:tablet 72075186226227048, :gen 2, :read-version "v1710410837721/18446744073709551615", :op "read", :node 50008, :read-id 1, :snapshot-repeatable true, :lock-tx-id 844430958736899}]
        [
            {:tablet 72075186226227048, :node 50008, :gen 2, :op "kqp-tx", :tx-id 281481004377038, :step 1710410838273, :version "v1710410838273/281481004377038"}
            {:tablet 72075186226226475, :node 50003, :gen 25, :op "kqp-tx", :tx-id 281481004377038, :step 1710410838273, :version "v1710410838273/281481004377038"}]]}

# append 637
{:index 300437, :time 240689511468, :type :ok, :process 1159, :f :txn,
    :value [
        [:append 192 637]
        [:append 196 45]
        [:append 196 46]],
    :modified-txn [
        [:batch #{192 196} #jepsen.ydb.append.OperationBatch{:ops [[:append 192 637] [:append 196 45]], :reads [], :writes [[192 637] [196 45]], :readmap {}}]
        [:commit nil [:append 196 46]]],
    :debug-info [
        [
            {:tablet 72075186226227048, :gen 2, :read-version "v1710410838260/18446744073709551615", :op "read", :node 50008, :read-id 1, :snapshot-repeatable true, :lock-tx-id 562955991686003}
            {:tablet 72075186226227048, :gen 2, :read-version "v1710410838260/18446744073709551615", :op "read", :node 50008, :read-id 1, :snapshot-repeatable true, :lock-tx-id 562955991686003}]
        [
            {:tablet 72075186226227048, :gen 2, :op "kqp-tx", :node 50008, :snapshot "v1710410838260/18446744073709551615", :tx-id 562955991686035, :snapshot-repeatable true, :lock-tx-id 562955991686035}
            {:tablet 72075186226227048, :gen 2, :read-version "v1710410838260/18446744073709551615", :op "read", :node 50008, :read-id 0, :snapshot-repeatable true, :lock-tx-id 562955991686035}
            {:tablet 72075186226227048, :node 50008, :gen 2, :op "kqp-tx", :tx-id 562955991686134, :version "v1710410838283/18446744073709551615"}]]}

What I could infer from this:

  • Tx with :append 192 630 committed first at v1710410838273/281481004377038, it was a pretty straight forward batch, single query with a separate commit, which applied deferred effects directly at commit time
  • Tx with :append 192 637 was part of a batch query (without commit), it performed reads at v1710410838260/18446744073709551615 with :lock-tx-id 562955991686003, deferred effects should have been buffered in memory
  • Then the second tx performed [:append 196 46] with fused commit. Since it append reads from the table it needed to flush deferred effects, and you can see it happened with :tx-id 562955991686035 and :lock-tx-id 562955991686035. Notice LockTxId is different (not 562955991686003) and matches TxId. This means AcquireLocksTxId was 0, and this can only happen when transaction has absolutely no locks in the list.
  • Since lock 562955991686003 (apparently) wasn't in memory we committed without checking it, and it successfully committed write of value 637 to the same index as 630.
  • All cases I've seen so far had :read-id that was not zero, apparently read was restarted for some reason, but current debug-info is only passed on success.
  • Currently debug-info doesn't log any info about acquired and validated locks, so it's unclear what the read actually returned (it's unclear whether the lock was successfully taken, for example, or whether it was initially broken).

I started seeing this after switching to complex query, so join and group by might be involved somehow.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions