Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: switch to SHUTTING_DOWN state unconditionally #4408

Merged
merged 2 commits into from
Jan 8, 2025
Merged

Conversation

romange
Copy link
Collaborator

@romange romange commented Jan 6, 2025

During the shutdown sequence always switch to SHUTTING_DOWN. Make sure that the rest of the code does not break if it can not switch to the desired global state + some clean ups around state transitions.

@romange romange requested a review from adiholden January 6, 2025 07:21
@romange romange force-pushed the FixShutdown branch 2 times, most recently from 7daf643 to 402cc7b Compare January 6, 2025 09:25
@romange
Copy link
Collaborator Author

romange commented Jan 6, 2025

@adiholden I also run the full regtest suit https://github.com/dragonflydb/dragonfly/actions/runs/12629900466

During the shutdown sequence always switch to SHUTTING_DOWN.
Make sure that the rest of the code does not break if it can not switch to the desired
global state + some clean ups around state transitions.

Finally, reduce the amount of data in test_replicaof_reject_on_load

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
}
if (switch_state) {
SwitchState(GlobalState::ACTIVE, GlobalState::LOADING);
loading_state_counter_++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it possible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how what possible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have 2 loading processes at the same time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


replica.stop()
replica.start()
c_replica = replica.client()

@assert_eventually
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you use the assert_eventually here? shouldnt we see the loading state at the first time we run the info persistence?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ignore the implementation details and assume there can be a short period of time where the server has not started loading yet or ServerState::g_state_ has not been updated yet. I embrace the eventual consistency nature of state transitions

persistence = await c_replica.info("PERSISTENCE")
assert persistence["loading"] == 1

# If this fails adjust `keys` and the `assert dbsize >= 30000` above.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can update the comment there is not assert on dbsize

@@ -1971,23 +1971,27 @@ async def test_replicaof_reject_on_load(df_factory, df_seeder_factory):
df_factory.start_all([master, replica])

c_replica = replica.client()
await c_replica.execute_command(f"DEBUG POPULATE 8000000")
await c_replica.execute_command(f"DEBUG POPULATE 4000000")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason we use large amount of keys here is to generate lots of data which will be later take time to load.
If you will use
seeder = StaticSeeder(
key_target=800,
data_size=100000,
collection_size=10000,
types=["SET"])
The test will be extremely fast and this will be same data size.
2 reasons for this:

  1. when populating we yeild every 32 keys added I think
  2. we add multiple elements in one command when using collections while for string we each time add one key

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case the variation of "debug populate ... type SET ... " could also work.

Copy link
Collaborator Author

@romange romange Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just confirmed debug populate 800 key 1000 RAND type set elements 2000 creates 800 sets with 2000 elements of size 1000 each

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StaticSeeder uses debug poulate its similar to your fix, anyway the change is good

@romange romange requested a review from adiholden January 7, 2025 18:52
if (schedule_done_.WaitFor(100ms)) {
return;
}
} while (ss->gstate() == GlobalState::LOADING);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason you move several places in the code to get the global state from ServerState?
I believe we used the service_.GetGlobalState() in this places intentionaly.
We assume that the ServerState::state is eventually consistent with the global state, therefor in places where we allow/recect running some flow based on the state we use the global state as it could be that there are 2 commands running at the same time one is changing the state and the other one checks the state to determine whether we can run the flow and might also change the state based on this check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's exactly why I removed it. It's misleading. There are no transactional guarantees when you test an atomic variable like this. The global state can change a millisecond later, so this check is not "better". I do not think that in all these places it's super important but I wanted to remove this false sense of correctness on purpose.

@kostasrim
Copy link
Contributor

Fixes this one #4423 right ?

@romange
Copy link
Collaborator Author

romange commented Jan 8, 2025

we will see

@romange romange merged commit 0a40083 into main Jan 8, 2025
9 checks passed
@romange romange deleted the FixShutdown branch January 8, 2025 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants