Skip to content

[SPARK-51745] Enforce State Machine for RocksDBStateStore #50497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 93 commits into
base: master
Choose a base branch
from

Conversation

ericm-db
Copy link
Contributor

@ericm-db ericm-db commented Apr 2, 2025

What changes were proposed in this pull request?

This PR introduces state machine validation to RocksDBStateStore to enforce proper usage patterns. It adds:

  1. Explicit state machine transitions between UPDATING, COMMITTED, and ABORTED states
  2. Validation logic to ensure operations are only executed in appropriate states
  3. Automatic cleanup via task completion listeners
  4. Error handling improvements for state store maintenance
  5. Refactoring of unload operation to ensure it runs only on maintenance threads

The implementation ensures that:

  • No updates can be made after a store has been committed or aborted
  • Metrics can only be accessed after commit/abort
  • All operations validate the state before execution
  • Task thread cleanup properly triggers maintenance for unloaded providers

Why are the changes needed?

RocksDBStateStore has implicit usage requirements that weren't being enforced. This could lead to incorrect usage patterns, potential data corruption, and hard-to-debug issues.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests have been added and existing tests modified to validate state transitions and error cases. The PR includes a new test case specifically for SPARK-51596 (unloading only occurs on maintenance thread but occurs promptly).

Was this patch authored or co-authored using generative AI tooling?

No

@ericm-db ericm-db changed the title [WIP] State machine hardening [SPARK-51745] Enforce State Machine for RocksDBStateStore and RocksDBStateStoreProvider Apr 8, 2025
@ericm-db ericm-db changed the title [SPARK-51745] Enforce State Machine for RocksDBStateStore and RocksDBStateStoreProvider [SPARK-51745] Enforce State Machine for RocksDBStateStore Apr 9, 2025
ericm-db and others added 14 commits April 8, 2025 17:27
* [WIP] provider state machine

* undoing changes

* moving tests, removing unnecessary changes

* moving tests back

* more unnecessary changes

* adding log

* this is so dumb

* Okay this might be the way to go

* let's see this

* this is less dumb

* removing thread local

* adding store.commit()

* timer suite

* taskCompletionListener -> taskFailureListener

* adding assertion that version == readStore.version

* fixing more tests

* returning same store
@ericm-db ericm-db force-pushed the state-machine-hardening branch from a573cb4 to 253d70b Compare May 6, 2025 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants