Skip to content

Conversation

ericm-db
Copy link
Contributor

@ericm-db ericm-db commented Apr 2, 2025

What changes were proposed in this pull request?

This PR introduces state machine validation to RocksDBStateStore to enforce proper usage patterns. It adds:

  1. Explicit state machine transitions between UPDATING, COMMITTED, and ABORTED states
  2. Validation logic to ensure operations are only executed in appropriate states
  3. Automatic cleanup via task completion listeners
  4. Error handling improvements for state store maintenance
  5. Refactoring of unload operation to ensure it runs only on maintenance threads

The implementation ensures that:

  • No updates can be made after a store has been committed or aborted
  • Metrics can only be accessed after commit/abort
  • All operations validate the state before execution
  • Task thread cleanup properly triggers maintenance for unloaded providers

Why are the changes needed?

RocksDBStateStore has implicit usage requirements that weren't being enforced. This could lead to incorrect usage patterns, potential data corruption, and hard-to-debug issues.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests have been added and existing tests modified to validate state transitions and error cases. The PR includes a new test case specifically for SPARK-51596 (unloading only occurs on maintenance thread but occurs promptly).

Was this patch authored or co-authored using generative AI tooling?

No

@ericm-db ericm-db changed the title [WIP] State machine hardening [SPARK-51745] Enforce State Machine for RocksDBStateStore and RocksDBStateStoreProvider Apr 8, 2025
@ericm-db ericm-db changed the title [SPARK-51745] Enforce State Machine for RocksDBStateStore and RocksDBStateStoreProvider [SPARK-51745] Enforce State Machine for RocksDBStateStore Apr 9, 2025
ericm-db and others added 14 commits April 8, 2025 17:27
* [WIP] provider state machine

* undoing changes

* moving tests, removing unnecessary changes

* moving tests back

* more unnecessary changes

* adding log

* this is so dumb

* Okay this might be the way to go

* let's see this

* this is less dumb

* removing thread local

* adding store.commit()

* timer suite

* taskCompletionListener -> taskFailureListener

* adding assertion that version == readStore.version

* fixing more tests

* returning same store
@ericm-db ericm-db force-pushed the state-machine-hardening branch from a573cb4 to 253d70b Compare May 6, 2025 15:45
@ericm-db ericm-db closed this Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants