Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only move datastream to stopped state if it is still in STOPPING state #974

Merged
merged 2 commits into from
Jan 16, 2024

Conversation

hshukla
Copy link
Collaborator

@hshukla hshukla commented Dec 13, 2023

When the datastream stop and resume call happens, the datastream will transition to READY -> STOPPING -> STOPPED state, transitioning from STOPPING TO STOPPED is an async process. The LEADER_DO_ASSIGNMENT has a logic on fetching all the datastream on STOPPING state and work on the clean up whenever it gets called, does not matter on the event type, as it is unaware of the type/operation on which it got called.

Sometimes, this behavior is causing the more work and stopping other data streams twice(as there would be one more call later on for stopping datastream).

While this behavior was fine before, recently, with restart improvement work, as mentioned above, we now perform these clean up operation in an async fashion, which leads concurrent processing. The issue happens when the resume comes after stop and this behavior marking datastream to STOPPED state instead of marking it to READY, as the transition happens like READY -> STOPPING -> STOPPED -> READY -> againSTOPPED

Example

When multiple datastreams stop requests come together, each will be put on the event queue separately. The leader coordinator then de-ques those events. Here lets say,
Stop comes for datastream_A at 0th second -> event will be in event queue; eventId=ABC
Stop comes for datastream_B at 10th second -> event will be in event queue; eventId=XYZ

now if leader coordiator de-ques the eventId=ABC at 11th second, it will perform the stop work for both datastreams; datastream_A and datastream_B and marks both of them as STOPPED(lets say at 30th second)

during this time the leader coordinator de-ques the eventId=XYZ (at 25th second), and notices that it is still in STOPPING state and starts the clean up process, this is happening async manner.

meanwhile, restart request for datastream_B comes at lets say 31th(since it was marked STOPPED at 30th second) and the receiving host marks it as READY (at 35th second)

now, on the leader host, the thread which was processing eventId=XYZ finishes the work and marks the datatream_B to STOPPED state again. It looks like we do check the status of datastream before marking it to STOPPED, but that status comes from cache, and cache sometimes/ usually (as these requests are happening pretty close to each other) is stale.

Solution

In order to speed up the fix for this regression, the quick light weight fix is to introduce ZK lookup for datastream status and mark it to STOPPED only if it is in STOPPING state, this will also invalid the cache and update the stale value in cache. This solution will still be useful along with the long term/perm fix described following.

For long term, the real issue here is, we are doing more work or extra work than we should be, we should introduce the in-memory data structure for STOP events that being processed and make sure the same task is being assigned to only one time. In the case of failure, we should call LEADER_DO_ASSIGMENT again. This will make sure that the problem does not occur at the first place. This change is tricky and would be mindful while being made, so the plan is to observe the system with this PR change first.

Copy link
Collaborator

@shrinandthakkar shrinandthakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! nice bug catch!

success = false;
}
} else {
_log.info("Not in stopping state, dsName={} status={}", zkDatastream.getName(), zkDatastream.getStatus());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Make the logs consistent to have the state logged in upper case, "STOPPED" in lines 1625,1629.

@hshukla hshukla merged commit 417d1f2 into master Jan 16, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants