Skip to content

save task generator info to ZK#9011

Closed
zhtaoxiang wants to merge 4 commits intoapache:masterfrom
zhtaoxiang:save-task-generator-info-to-zk
Closed

save task generator info to ZK#9011
zhtaoxiang wants to merge 4 commits intoapache:masterfrom
zhtaoxiang:save-task-generator-info-to-zk

Conversation

@zhtaoxiang
Copy link
Contributor

@zhtaoxiang zhtaoxiang commented Jul 4, 2022

Changes

Save task generator info to ZK to help with task generator debug. Specifically
1/ the ZK path is /MINION_TASK_GENERATOR_INFO/${tableNameWithType}/${taskType}. We don't reuse the /MINION_TASK_METADATA/${tableNameWithType}/${taskType} because we don't want to mess up the existing ZNodes.
2/ save the most recent 5 success task generation run timestamp to ZK
3/ save the most recent 5 error task generation run timestamp and message to ZK

This logic change is not on the main data or control path, so it should be safe.

@codecov-commenter
Copy link

codecov-commenter commented Jul 4, 2022

Codecov Report

Merging #9011 (4dad0bf) into master (eda488f) will decrease coverage by 43.56%.
The diff coverage is 50.00%.

❗ Current head 4dad0bf differs from pull request most recent head 7042c18. Consider uploading reports for the commit 7042c18 to get more accurate results

@@              Coverage Diff              @@
##             master    #9011       +/-   ##
=============================================
- Coverage     70.01%   26.45%   -43.57%     
+ Complexity     4959        1     -4958     
=============================================
  Files          1826     1818        -8     
  Lines         95975    95745      -230     
  Branches      14350    14328       -22     
=============================================
- Hits          67197    25327    -41870     
- Misses        24134    68012    +43878     
+ Partials       4644     2406     -2238     
Flag Coverage Δ
integration1 26.45% <50.00%> (+0.18%) ⬆️
integration2 ?
unittests1 ?
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...he/pinot/common/utils/helix/FakePropertyStore.java 0.00% <0.00%> (-56.53%) ⬇️
...egment/local/utils/tablestate/TableStateUtils.java 0.00% <0.00%> (ø)
...che/pinot/common/minion/BaseTaskGeneratorInfo.java 20.00% <20.00%> (ø)
...controller/helix/core/minion/PinotTaskManager.java 33.53% <39.13%> (-33.77%) ⬇️
...he/pinot/common/minion/TaskGeneratorInfoUtils.java 56.25% <56.25%> (ø)
.../common/minion/TaskGeneratorMostRecentRunInfo.java 70.68% <70.68%> (ø)
...on/minion/TaskGeneratorMostRecentRunInfoUtils.java 80.00% <80.00%> (ø)
...ache/pinot/common/metadata/ZKMetadataProvider.java 65.16% <100.00%> (-2.07%) ⬇️
.../java/org/apache/pinot/spi/utils/BooleanUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...java/org/apache/pinot/spi/trace/BaseRecording.java 0.00% <0.00%> (-100.00%) ⬇️
... and 1339 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eda488f...7042c18. Read the comment docs.

for (TableConfig tableConfig : enabledTableConfigs) {
try {
TaskGeneratorMostRecentRunInfoUtils.saveErrorRunMessageToZk(_pinotHelixResourceManager.getPropertyStore(),
tableConfig.getTableName(), taskGenerator.getTaskType(), System.currentTimeMillis(), e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get the whole stack trace, not just the message

@mcvsubbu
Copy link
Contributor

Helix provides a history of all tasks queued. I believe there is a way to configure the retention time of the history before deleting it. Does that not work for you?
cc: @jackjlli , @snleee

Also, can we do better than a Utils class with static methods? (Hard to unit test).

@npawar
Copy link
Contributor

npawar commented Jul 11, 2022

Helix provides a history of all tasks queued. I believe there is a way to configure the retention time of the history before deleting it. Does that not work for you? cc: @jackjlli , @snleee

Also, can we do better than a Utils class with static methods? (Hard to unit test).

This PR is for Generator side errors I believe, not for the task errors

@mcvsubbu
Copy link
Contributor

Helix provides a history of all tasks queued. I believe there is a way to configure the retention time of the history before deleting it. Does that not work for you? cc: @jackjlli , @snleee
Also, can we do better than a Utils class with static methods? (Hard to unit test).

This PR is for Generator side errors I believe, not for the task errors

Hmm.. so we are using zk like a log? Why not just add a metric and set alerts on it?

You can also pipe logs into log processors.

We don't want to pollute ZK with information that can be got from other means.

@npawar
Copy link
Contributor

npawar commented Jul 12, 2022

Helix provides a history of all tasks queued. I believe there is a way to configure the retention time of the history before deleting it. Does that not work for you? cc: @jackjlli , @snleee
Also, can we do better than a Utils class with static methods? (Hard to unit test).

This PR is for Generator side errors I believe, not for the task errors

Hmm.. so we are using zk like a log? Why not just add a metric and set alerts on it?

You can also pipe logs into log processors.

We don't want to pollute ZK with information that can be got from other means.

I think it's okay, given we're limiting it to just 5 recent tasks. Most minion related errors and confusion is seen by users when getting started. Given the popularity of trying to use the minion tasks for ingestion and rollups, we're seeing an influx of questions in the community slack, about minion related debugging. Most of the time, the users are asking if there's an API to quickly see the scheduling side and task execution side errors. Plus during the getting started phase, metrics reporting backend, alerting or log processors isn't usually setup.
This will be very helpful for such users to quickly debug their issues during the getting started process. This will be used in the Minion UI tab that's come up in the Pinot UI now: #8978

@npawar
Copy link
Contributor

npawar commented Jul 13, 2022

Closing this as the work is being picked up by @saurabhd336 in #9043

@npawar npawar closed this Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants