Skip to content

Replace expensive TaskDriver.getWorkflows() with lightweight getChildNames()#17779

Open
KKcorps wants to merge 3 commits intoapache:masterfrom
KKcorps:fix/avoid-getWorkflows-heavy-zk-read
Open

Replace expensive TaskDriver.getWorkflows() with lightweight getChildNames()#17779
KKcorps wants to merge 3 commits intoapache:masterfrom
KKcorps:fix/avoid-getWorkflows-heavy-zk-read

Conversation

@KKcorps
Copy link
Contributor

@KKcorps KKcorps commented Feb 27, 2026

Summary

TaskDriver.getWorkflows() internally calls getChildValuesMap() on Helix resource configs, which downloads the full content of every resource config ZNode from ZooKeeper. This is extremely expensive when there are many resources (tables, workflows, jobs) and has been found to cause OOMs on controller

All three callers in PinotHelixTaskResourceManager only needed the workflow names (keys), not the actual WorkflowConfig values:

  • getTaskTypes() — uses .keySet()
  • getTaskQueues() — uses .keySet()
  • getRunningTaskCountsPerMinion() — uses .keySet()

This PR replaces those calls with HelixDataAccessor.getChildNames(keyBuilder().resourceConfigs()), which fetches only ZNode names (a lightweight metadata-only ZK call), filtered to names with the TaskQueue_ prefix.

Changes

  • Added private getTaskQueueNames() helper in PinotHelixTaskResourceManager that uses getChildNames() + prefix filter
  • Replaced all 3 usages of _taskDriver.getWorkflows().keySet() with the new helper
  • Updated tests to create ZK resource config ZNodes (since getChildNames reads from ZK, not the mocked TaskDriver)

Test plan

  • All 16 existing tests in PinotHelixResourceManagerMinionStatusTest pass
  • Compilation and checkstyle pass

…Names()

TaskDriver.getWorkflows() internally calls getChildValuesMap() which
downloads the full content of every resource config ZNode from ZooKeeper.
All callers in PinotHelixTaskResourceManager only need the workflow names
(keys), not the actual WorkflowConfig values. This replaces those calls
with HelixDataAccessor.getChildNames(resourceConfigs()) which fetches
only ZNode names, filtered to the TaskQueue_ prefix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@KKcorps KKcorps requested a review from shounakmk219 February 27, 2026 07:51
@codecov-commenter
Copy link

codecov-commenter commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 63.20%. Comparing base (063b6e4) to head (14ca0b7).
⚠️ Report is 36 commits behind head on master.

Files with missing lines Patch % Lines
...lix/core/minion/PinotHelixTaskResourceManager.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17779      +/-   ##
============================================
- Coverage     63.24%   63.20%   -0.04%     
  Complexity     1454     1454              
============================================
  Files          3176     3182       +6     
  Lines        190984   191367     +383     
  Branches      29202    29268      +66     
============================================
+ Hits         120780   120963     +183     
- Misses        60784    60964     +180     
- Partials       9420     9440      +20     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (?)
java-11 63.18% <0.00%> (-0.04%) ⬇️
java-21 63.17% <0.00%> (+0.01%) ⬆️
temurin 63.20% <0.00%> (-0.04%) ⬇️
unittests 63.20% <0.00%> (-0.04%) ⬇️
unittests1 55.60% <ø> (+0.02%) ⬆️
unittests2 34.12% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Kartik Khare and others added 2 commits February 27, 2026 15:00
Resource configs also contain individual job entries like
TaskQueue_Type_Task_Type_12345. Filter these out by excluding
names containing _Task_ separator, keeping only the task queue
workflow entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use doReturn().when(spy) instead of when(spy.method()).thenReturn()
to avoid calling the real getTaskTypes() during stub setup, which
now accesses HelixDataAccessor via the (unmocked) resource manager.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@KKcorps KKcorps requested a review from shounakmk219 February 27, 2026 11:13
* contain {@link #TASK_PREFIX} e.g. {@code TaskQueue_Type_Task_Type_12345}). This avoids the expensive
* {@link TaskDriver#getWorkflows()} call which reads all resource config values.
*/
private List<String> getTaskQueueNames() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have this method return a set ? this will avoid changing all the clients.

HelixDataAccessor accessor = _helixResourceManager.getHelixZkManager().getHelixDataAccessor();
List<String> resourceConfigNames = accessor.getChildNames(accessor.keyBuilder().resourceConfigs());
return resourceConfigNames.stream()
.filter(name -> name.startsWith(TASK_QUEUE_PREFIX) && !name.contains(TASK_NAME_SEPARATOR + TASK_PREFIX))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a hack. I am not sure if this is going to be easy to maintain & verify.


PinotHelixTaskResourceManager spyMgr = Mockito.spy(mgr);
when(spyMgr.getTaskTypes()).thenReturn(Collections.emptySet());
Mockito.doReturn(Collections.emptySet()).when(spyMgr).getTaskTypes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we have to change this? Is it just a syntactical thing, or is there something which was wrong with the older code?

// Create workflow context with jobs for the task queue
WorkflowContext workflowContext = mock(WorkflowContext.class);
when(mockTaskDriver.getWorkflowContext("TestWorkflow1")).thenReturn(workflowContext);
when(mockTaskDriver.getWorkflowContext(TEST_TASK_QUEUE)).thenReturn(workflowContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the name getting changed from TestWorkflow1 to TaskQueue_TestType ?

/**
* Creates a resource config ZNode in ZK so that getChildNames(resourceConfigs()) returns the task queue name.
*/
private void createTaskQueueResourceConfig() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not fully clear why this test itself needed to be modified. The change in the main class is an internal change. Why should it have the tests affected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah ok got it. its because TaskDriver is mocked earlier. but now we need Znodes

*/
private void createTaskQueueResourceConfig() {
HelixDataAccessor accessor = _helixResourceManager.getHelixZkManager().getHelixDataAccessor();
accessor.setProperty(accessor.keyBuilder().resourceConfig(TEST_TASK_QUEUE),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set more properties here, similar to how the actual Zookeeper data will be. That will just help in increasing the comprehensiveness of the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants