Skip to content

Realtime table consumption resume API#8663

Merged
sajjad-moradi merged 17 commits intoapache:masterfrom
saurabhd336:segmentResumeApi
May 24, 2022
Merged

Realtime table consumption resume API#8663
sajjad-moradi merged 17 commits intoapache:masterfrom
saurabhd336:segmentResumeApi

Conversation

@saurabhd336
Copy link
Contributor

@saurabhd336 saurabhd336 commented May 9, 2022

This PR adds controller API support for resuming consumption for realtime tables after deletion of CONSUMING segment from ZK.

GH issue: #6679

Testing Done

Recovery from CONSUMING segment deleted via API

  1. Create a realtime table, let it consume few rows such that a few ONLINE segments are created.
  2. Delete the latest CONSUMING segment from PROPERTYSTORE and IDEALSTATE using the segment delete API.
  3. Confirm that invoking the RealtimeSegmentValidationManager task either via the periodictask/run or through auto scheduled task doesn't fix the table state.
  4. Confirm that hitting the newly added /tables/{tableName}/resumeConsumption API fixes it by creating new IN_PROGRESS and CONSUMING segments in PROPERTYSTORE and IDEALSTATE respectively.
  5. Confirm that the start offset for the newly created segment is same as latest ONLINE segment's end offset.
  6. Confirm that consumption resumes by pushing messages into kafka stream.

Recovery from CONSUMING segment deleted manually
All steps remain same as above except that the consuming segment is deleted from ZK manually via controller UI / ZK client.

cc: @sajjad-moradi

@saurabhd336
Copy link
Contributor Author

cc: @npawar @Jackie-Jiang

@codecov-commenter
Copy link

codecov-commenter commented May 9, 2022

Codecov Report

Merging #8663 (285b114) into master (2e32ec2) will increase coverage by 1.65%.
The diff coverage is 29.68%.

@@             Coverage Diff              @@
##             master    #8663      +/-   ##
============================================
+ Coverage     68.02%   69.68%   +1.65%     
- Complexity     4616     4617       +1     
============================================
  Files          1732     1733       +1     
  Lines         90734    90762      +28     
  Branches      13507    13512       +5     
============================================
+ Hits          61724    63246    +1522     
+ Misses        24672    23109    -1563     
- Partials       4338     4407      +69     
Flag Coverage Δ
integration1 27.06% <29.68%> (?)
integration2 25.23% <9.37%> (-0.26%) ⬇️
unittests1 66.28% <0.00%> (+<0.01%) ⬆️
unittests2 14.22% <23.43%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../pinot/common/messages/RunPeriodicTaskMessage.java 0.00% <0.00%> (ø)
...er/ControllerUserDefinedMessageHandlerFactory.java 16.66% <0.00%> (-1.29%) ⬇️
...ache/pinot/controller/api/resources/Constants.java 26.31% <ø> (ø)
...es/PinotControllerPeriodicTaskRestletResource.java 0.00% <0.00%> (ø)
...ller/api/resources/PinotRealtimeTableResource.java 0.00% <0.00%> (ø)
...e/pinot/controller/helix/SegmentStatusChecker.java 87.05% <ø> (+1.76%) ⬆️
...ntroller/helix/core/PinotHelixResourceManager.java 66.85% <0.00%> (-0.50%) ⬇️
...controller/helix/core/minion/PinotTaskManager.java 67.29% <ø> (+14.15%) ⬆️
...er/validation/BrokerResourceValidationManager.java 81.25% <ø> (+56.25%) ⬆️
.../core/realtime/PinotLLCRealtimeSegmentManager.java 77.59% <56.00%> (+2.37%) ⬆️
... and 148 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80db165...285b114. Read the comment docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the actual logic into the PinotLLCRealtimeSegmentManager and inject that into this API class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not ask user to provide these info. As a starting point, we can use the end offset of the last completed segment as the start offset of the new segment. Then we can add support to all the OffsetCriteria and also the custom offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the discussion on #6679, and the general consensus seemed to be on first allowing explicitly specifying partitionId -> offset to create new consuming segments, and then support implicitly deriving the offsets from existing segments?
If that's not the case, I will change this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway I've made the changes. We can add support for explicitly specifying offsets on top of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to-do should be addressed in the first version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats the plan. Will address

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First check if there is already a consuming segment for this partition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@ApiOperation(value = "Resume a realtime table", notes = "Resume a segment")
@ApiOperation(value = "Resume the consumption of a realtime table", notes = "Resume the consumption of a realtime table")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this should be under the table tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Path("/segmentState/{tableName}/resumeRealtimeTable")
@Path("/tables/{tableName}/resumeConsumption")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can take an optional parameter of the partitions to be resumed

@saurabhd336 saurabhd336 force-pushed the segmentResumeApi branch 3 times, most recently from dffc867 to 497d205 Compare May 10, 2022 14:33
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add @sajjad-moradi to also take a look

Copy link
Contributor

@sajjad-moradi sajjad-moradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a design doc for a broader topic "pause/resume consumption on realtime table". The functionality of the endpoint provided in this PR will be part of the resume endpoint mentioned in the design doc.

If there's not an urgent need to have this functionality, I'm going to work on the pause/resume soon so we don't need to refactor the code in this PR when pause/resume is implemented.

But if this functionality is needed ASAP, for now we can have it in the resume endpoint and later on when we implement the broader pause/resume feature, we should make sure that the resume endpoint works for paused tables and it also support this PR's functionality. We should be careful for the API to be forward compatible with pause/resume.

@saurabhd336
Copy link
Contributor Author

@sajjad-moradi I had a discussion with @npawar on this, and seems like this functionality addresses issues arising due to accidental deletion of consuming segments, and therefore its availability is of importance, beyond the Pause / Resume feature too.
From the Pause / Resume design doc's perspective, I see this API fitting the Resume API description fairly well and the "in one Ideal State update operation, sets the “isTablePaused” flag to false" operation can be added to this, optionally if the table is indeed paused? Do you see any challenges with that? Are we saying we'd need to write the resume API from scratch when implementing the Pause / Resume feature?

cc: @npawar for comments

@saurabhd336 saurabhd336 changed the title Segment resume API Realtime table consumption resume API May 11, 2022
@sajjad-moradi
Copy link
Contributor

Do you see any challenges with that? Are we saying we'd need to write the resume API from scratch when implementing the Pause / Resume feature?

Not really. I saw in the PR instead of ideal state, metadata is used. So the changes needed is very similar to the resume endpoint in the design doc. Just wanted to make sure these two are aligned. Will review your changes shortly.

Copy link
Contributor

@sajjad-moradi sajjad-moradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not easy to add a unit test here. Could you please add a small section to the PR description on how you tested the new functionality.

Comment on lines 1438 to 1441
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createNewSegmentZKMetadata can be used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tableConfig.getTableName will always give you tableNameWithType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you intentionally including all new partition groups, instead of just the ones that were deleted and user wants to restore? I don't think it's a good idea to introduce yet another place where all new partition groups are read to start consumption (it already happens in ensureAllPartitionsConsuming)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional, yes. But we can discuss this offline. Idea was to keep this as a generic resumeConsumption API, which also happens to double up as a fix for deleted segments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But it's not going to be a generic resume consumption Api unless all the other cases from ensureAllPartitions are also included. Now we have two ways to resume consumption, both doing only partial cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think if we make the following changes to ensureAllPartitionsConsuming method :

ensureAllPartitionsConsuming(..., boolean enforcedByAdmin) {
  ...
  if (enforcedByAdmin) {
    // handle the case where a realtime segment is manually deleted 
  }
  ...
}

Then this method is called in the resume endpoint with enforcedByAdmin = true and in the periodic task with enforcedByAdmin = false.
Later if we have more cases that require admin intervention, we can add them to the if (enforcedByAdmin) {...} block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great idea. +1 on doing it this way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. This question turned out to be redundant since creating new consuming segments for new partitions is already handled by ensureAllPartitionsConsuming.
All in all, the API does everything ensureAllPartitionsConsuming already does, and additionally also creates consuming segments for partitions where CONSUMING segments were deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, having a single flag of enforcedByAdmin can be quite confusing and hard to use because this module (segment manager) should not be aware of the intention of the admin. I'd suggest making the flag specific and self-explained, such as recreateDeletedConsumingSegment. In the future if we want to add other checks, we can introduce other flags, or wrap the flags into a options class, but each flag should be specific. @sajjad-moradi @npawar What's your opinion on this suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thought was also of a generic name like "fixAllCases" or "fixManualErrors" . But i see your point, self-explanatory name would be a good start and introduce more as and when needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid point. We can start with this specific flag now and later on, when we add more cases, we rename the flag to be more generic or add more options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Renaming the flag. Wrapping these flags into a Config / Props class eventually sounds good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to ensure that this change plays nicely with the code in ensureAllPartitionsConsuming. That method also creates new zkMetadata and idealState entry, and these 2 should not interfere with each other. You'll have to add some synchronization on table name between the 2 methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the redesign, this is achieved. Although I had to serialize calls to ensureAllPartitionsConsuming since there is a possibility of API call running concurrently with the periodic task. I think we can improve upon this by having table level locks being acquired within the function. Have added that as a todo since it should not be much of an issue I feel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureAllPartitionsConsuming called in periodic task only runs on the lead controller of each table while this new resume API can be executed on any controller, so locking/synchronization is not much of a help here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. TableName level locks won't help much in that case. But synchronizing the method is still needed? To address the, albeit rare, possibility of the API being hit on the lead controller, while the periodic task is running.
Since its a background periodic task, the overhead of the syncronized function shouldn't be much I feel.. Any suggestions on this?

Copy link
Contributor

@npawar npawar May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synchronization is still needed for same controller yes.. but the other case is more concerning. for instance, on the 2 controllers (1 via periodic task thread other via resumeConsumption thread), we could enter setupNewPartitionGroup method at once, and end up with 2 zk metadata and 2 CONSUMING segments for a new partition. Many more such cases could come up.
Wondering now, if adding a back door entry to this method from any controller is a good idea. This won't happen if only periodic task is allowed to execute that method (be it scheduled or manual). So 1) should this flag just be part of manual periodic task trigger options and then 2) this API internally just invokes the periodic task

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. So it can potentially lead to race condition if 2 runs happen at exactly the same time, and both create the new segment ZK metadata. One of the segment ZK metadata will be picked and continue, the other will be left orphan.
@saurabhd336 Let's change ensureAllPartitionsConsuming to synchronized, and add a TODO describing this potential race condition. We should probably clean up the orphan segment ZK metadata if it is not in the ideal state. This can be handled in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do so if we go with manually kicking RealtimeSegmentValidation job as Neha suggested. The code in PinotControllerPeriodicTaskRestletResource can be modified a bit to address to achieve this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sajjad-moradi Not sure I understand 'manually kicking RealtimeSegmentValidation job'. That still means ensureAllPartitionsConsuming can get called concurrently by the API thread and the scheduled thread?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saurabhd336 just think of the periodic task. There's an API to invoke periodic task manually (look for Periodic Task tab in swagger). Whether it runs by schedule, or by manual trigger, periodic task can only run on lead controller of a table, so that eliminates across controller clashes. As for same controller, if you look at BasePeriodicTask, you'll see it doesn't let more than one task to run at once.

So the suggestion is, rely on this mechanism as we already know this works and prevents clashes. One way to rely on this mechanism is by introducing a query param to the runPeriodicTask API with our extra config and pass it to the periodic task execution. Another way could be to keep your new API, but again use same mechanism happening in runPeriodicTask API to trigger the task.

does this help clarify? We can chat on the slack channel if need be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not have full context on the periodic task execution via API. But now I've updated the PR with

  1. Allowing custom params to be passed to on demand task execution via helix messages.
  2. Changed the API to use the periodic task execution instead of directly calling the method.
    Please have a look @npawar @Jackie-Jiang @sajjad-moradi

@saurabhd336 saurabhd336 force-pushed the segmentResumeApi branch 2 times, most recently from 51b4dc1 to 5f0fd4d Compare May 17, 2022 09:30
@saurabhd336
Copy link
Contributor Author

@npawar @Jackie-Jiang @sajjad-moradi I've updated the PR as per the comments. Please have a look.

@saurabhd336 saurabhd336 force-pushed the segmentResumeApi branch 2 times, most recently from 6004072 to 256e678 Compare May 17, 2022 14:29
Comment on lines 1162 to 1181
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract some common logic to a separate helper method? It is almost identical to the all OFFLINE scenario except for it using the end-offset for the latest segment instead of start-offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new function for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work without synchronization because it is performing a check and write

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update the javadoc for this method to include the new scenario:

If the consuming segment is deleted:
Check whether there are segments in the PROPERTYSTORE with status DONE, but no new segment in status IN_PROGRESS, and the state for the latest segment in the IDEALSTATE is ONLINE

(Note that this is very similar to the failure between step-1 and step-2, the only difference is the state in the ideal state)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work without synchronization because it is performing a check and write

Wasn't aware of this. Looks like helix itself guarantees concurrent updates are handeled safely. Removed synchronization.

Let's update the javadoc for this method to include the new scenario:

If the consuming segment is deleted:
Check whether there are segments in the PROPERTYSTORE with status DONE, but no new segment in status IN_PROGRESS, and the state for the latest segment in the IDEALSTATE is ONLINE

(Note that this is very similar to the failure between step-1 and step-2, the only difference is the state in the ideal state)

Ack. Added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new scenario is very similar to 1, so we can keep them together:
We first check if the latestSegmentZKMetadata is DONE, then check if instanceStateMap contains CONSUMING segment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Change the index (currently there are 2 index 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If we get here, that means in IdealState, the latest segment has no CONSUMING replicas, but has
// If we get here, that means in IdealState, the latest segment has all replicas ONLINE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some checks before fixing the segment:

  • latestSegmentZKMetadata.getStatus() == Status.DONE
  • isAllInstancesInState(instanceStateMap, SegmentStateModel.ONLINE)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @npawar @sajjad-moradi Can you please also take another look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)

Suggested change
// If we get here, that means in IdealState, the latest segment has all replicas ONLINE, but no
// CONSUMING segments.
// If we get here, that means in IdealState, the latest segment has all replicas ONLINE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Comment on lines 1146 to 1155
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(code style) Let's reformat this part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may take a Map<String, String> here and use the map field from the ZNRecord

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest naming it "properties" to be consistent with the name in RunPeriodicTaskMessageHandler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Renamed to 'taskProperties'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can write the key-value pairs of the map into the Properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Make "RealtimeSegmentValidationManager" a constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest making it more explicit that this is a resume consumption request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest passing the whole Properties into the processTables to be more flexible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, passing the whole Properties into preprocess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to catch this exception when the key-values are already put into the properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to Properties instead of json makes preprocess not throw exceptions. Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already an inner class of RealtimeSegmentValidationManager, so no need to have a separate wrapper class for the parameters. Suggest directly adding _recreateDeletedConsumingSegment into the Context class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is almost identical to the PinotControllerPeriodicTaskRestletResource, let's add a TODO here to extract the common part in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a technical debt. The code as well as comments are duplicated here. We can simply refactor it to a class (or a util method), something like PeriodicTaskTriggerer.sendRunPeriodicTaskMessage() or PeriodicTaskTriggerUtil.sendRunPeriodicTaskMessage().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. PinotHelixResourceManager seemed like the most natural place for the method to reside. Have moved it there and changed both the classes to use it. Please check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention here that this happens only if the special flag is set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@saurabhd336 saurabhd336 force-pushed the segmentResumeApi branch 2 times, most recently from 96d9504 to 777ce7a Compare May 23, 2022 06:16
Copy link
Contributor

@sajjad-moradi sajjad-moradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a few minor comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a technical debt. The code as well as comments are duplicated here. We can simply refactor it to a class (or a util method), something like PeriodicTaskTriggerer.sendRunPeriodicTaskMessage() or PeriodicTaskTriggerUtil.sendRunPeriodicTaskMessage().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the new comments here. Up to this line, the comments explain different controller failure cases in segment commit protocol. Let's not break those cases apart. Also let's change it to If the consuming segment is deleted by user, intentionally or by mistake:...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add ... is set to true which means it's manually triggered by admin not by automatic periodic task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Comment on lines 1143 to 1144
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the next if statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@sajjad-moradi
Copy link
Contributor

Please add a Testing Done section to the PR description that describes your manual testing. It should cover two scenarios:

  • Delete a consuming segment using segment delete endpoint (which deletes both IS and segZkMetadata) and then call resume endpoint
  • Manually delete IS entries for a consuming segment in ZK and then call resume endpoint

Comment on lines 86 to 89
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Can be simplified

Suggested change
if (periodicTaskProperties.containsKey(RECREATE_DELETED_CONSUMING_SEGMENT_KEY)) {
context._recreateDeletedConsumingSegment =
Boolean.parseBoolean(periodicTaskProperties.getProperty(RECREATE_DELETED_CONSUMING_SEGMENT_KEY));
}
context._recreateDeletedConsumingSegment =
Boolean.parseBoolean(periodicTaskProperties.getProperty(RECREATE_DELETED_CONSUMING_SEGMENT_KEY));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants