Only move datastream to stopped state if it is still in STOPPING state #974

hshukla · 2023-12-13T00:27:19Z

When the datastream stop and resume call happens, the datastream will transition to READY -> STOPPING -> STOPPED state, transitioning from STOPPING TO STOPPED is an async process. The LEADER_DO_ASSIGNMENT has a logic on fetching all the datastream on STOPPING state and work on the clean up whenever it gets called, does not matter on the event type, as it is unaware of the type/operation on which it got called.

Sometimes, this behavior is causing the more work and stopping other data streams twice(as there would be one more call later on for stopping datastream).

While this behavior was fine before, recently, with restart improvement work, as mentioned above, we now perform these clean up operation in an async fashion, which leads concurrent processing. The issue happens when the resume comes after stop and this behavior marking datastream to STOPPED state instead of marking it to READY, as the transition happens like READY -> STOPPING -> STOPPED -> READY -> againSTOPPED

Example

When multiple datastreams stop requests come together, each will be put on the event queue separately. The leader coordinator then de-ques those events. Here lets say,
Stop comes for datastream_A at 0th second -> event will be in event queue; eventId=ABC
Stop comes for datastream_B at 10th second -> event will be in event queue; eventId=XYZ

now if leader coordiator de-ques the eventId=ABC at 11th second, it will perform the stop work for both datastreams; datastream_A and datastream_B and marks both of them as STOPPED(lets say at 30th second)

during this time the leader coordinator de-ques the eventId=XYZ (at 25th second), and notices that it is still in STOPPING state and starts the clean up process, this is happening async manner.

meanwhile, restart request for datastream_B comes at lets say 31th(since it was marked STOPPED at 30th second) and the receiving host marks it as READY (at 35th second)

now, on the leader host, the thread which was processing eventId=XYZ finishes the work and marks the datatream_B to STOPPED state again. It looks like we do check the status of datastream before marking it to STOPPED, but that status comes from cache, and cache sometimes/ usually (as these requests are happening pretty close to each other) is stale.

Solution

In order to speed up the fix for this regression, the quick light weight fix is to introduce ZK lookup for datastream status and mark it to STOPPED only if it is in STOPPING state, this will also invalid the cache and update the stale value in cache. This solution will still be useful along with the long term/perm fix described following.

For long term, the real issue here is, we are doing more work or extra work than we should be, we should introduce the in-memory data structure for STOP events that being processed and make sure the same task is being assigned to only one time. In the case of failure, we should call LEADER_DO_ASSIGMENT again. This will make sure that the problem does not occur at the first place. This change is tricky and would be mindful while being made, so the plan is to observe the system with this PR change first.

shrinandthakkar

lgtm! nice bug catch!

shrinandthakkar · 2023-12-14T20:47:10Z

datastream-server/src/main/java/com/linkedin/datastream/server/Coordinator.java

+              success = false;
+            }
+          } else {
+            _log.info("Not in stopping state, dsName={} status={}", zkDatastream.getName(), zkDatastream.getStatus());


nit: Make the logs consistent to have the state logged in upper case, "STOPPED" in lines 1625,1629.

Only move datastream to stopped state if it is still in STOPPING state

d777708

shrinandthakkar previously approved these changes Dec 14, 2023

View reviewed changes

Only move datastream to stopped state if it is still in STOPPING state

555b83d

hshukla dismissed shrinandthakkar’s stale review via 555b83d December 15, 2023 21:15

shrinandthakkar approved these changes Dec 18, 2023

View reviewed changes

atoomula approved these changes Jan 12, 2024

View reviewed changes

hshukla merged commit 417d1f2 into master Jan 16, 2024
1 check passed

hshukla mentioned this pull request Jan 19, 2024

Revert "Added new class to help log assignments" #977

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only move datastream to stopped state if it is still in STOPPING state #974

Only move datastream to stopped state if it is still in STOPPING state #974

hshukla commented Dec 13, 2023 •

edited

Loading

shrinandthakkar left a comment

shrinandthakkar Dec 14, 2023

Only move datastream to stopped state if it is still in STOPPING state #974

Only move datastream to stopped state if it is still in STOPPING state #974

Conversation

hshukla commented Dec 13, 2023 • edited Loading

shrinandthakkar left a comment

Choose a reason for hiding this comment

shrinandthakkar Dec 14, 2023

Choose a reason for hiding this comment

hshukla commented Dec 13, 2023 •

edited

Loading