Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use constant value supplier to measure ingestion delay metrics #12957

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

soumitra-st
Copy link
Contributor

@soumitra-st soumitra-st commented Apr 17, 2024

Problem

In multiple Pinot deployments, we have seen the REALTIME_INGESTION_DELAY_MS and END_TO_END_REALTIME_INGESTION_DELAY_MS metrics monotonically grow even though there were no active Kafka consumers.

How did I figure out that there are no active Kafka consumers?

consumingSegmentsInfo API returned the below response:

{
  "_segmentToConsumingInfoMap": {
    "telemetry__0__2399__20240415T1650Z": [],
    "telemetry__1__2435__20240415T1600Z": [],
    "telemetry__2__757__20240415T1632Z": [],
    "telemetry__3__2413__20240415T1642Z": []
  }
}

The above response shows that consumers are not connected. I also checked the thread dump, and there are no Kafka consumer threads.

Image showing REALTIME_INGESTION_DELAY_MS grow monotonically

zoomed-out-ingestion-delay

Zoomed out image showing REALTIME_INGESTION_DELAY_MS grow monotonically every minute

zoomed-in-ingestion-delay

Solution

Changed the supplier to provide constant value to publish REALTIME_INGESTION_DELAY_MS and END_TO_END_REALTIME_INGESTION_DELAY_MS metrics.

Ingestion delay metrics are published for every batch of messages fetched off Kafka HERE, and this will continue to work when Kafka consumers are active.

bugfix

@codecov-commenter
Copy link

codecov-commenter commented Apr 17, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 62.16%. Comparing base (59551e4) to head (81a0382).
Report is 318 commits behind head on master.

Files Patch % Lines
...e/data/manager/realtime/IngestionDelayTracker.java 50.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12957      +/-   ##
============================================
+ Coverage     61.75%   62.16%   +0.41%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2502      +66     
  Lines        133233   136468    +3235     
  Branches      20636    21120     +484     
============================================
+ Hits          82274    84840    +2566     
- Misses        44911    45344     +433     
- Partials       6048     6284     +236     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 35.09% <50.00%> (-26.62%) ⬇️
java-21 28.01% <0.00%> (-33.62%) ⬇️
skip-bytebuffers-false 62.15% <50.00%> (+0.40%) ⬆️
skip-bytebuffers-true 28.00% <0.00%> (+0.27%) ⬆️
temurin 62.16% <50.00%> (+0.41%) ⬆️
unittests 62.16% <50.00%> (+0.41%) ⬆️
unittests1 46.63% <50.00%> (-0.26%) ⬇️
unittests2 28.01% <0.00%> (+0.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the correct fix. We rely on the callback to track the ingestion delay.
I feel the real issue is this metric is not removed after the consumer is destroyed.

@soumitra-st
Copy link
Contributor Author

I don't think this is the correct fix. We rely on the callback to track the ingestion delay. I feel the real issue is this metric is not removed after the consumer is destroyed.

There are two issues:

  1. Why metric is not removed when consumer is destroyed?
  2. Why metric keep increasing when no consumer exists?
    This PR address Comment typo in last SQL example. #2. In fact, the happy path publishes the metric after every batch at
    updateIngestionDelay(messageBatch.getLastMessageMetadata());
    .

IMO, we need to fix both the issues.

@Jackie-Jiang
Copy link
Contributor

It is under the if (previousMeasure == null) branch, which means this PR will completely turn off the gauge...
I don't see this as a problem because we want to get the delay dynamically. The really issue is why the metric is not removed

@soumitra-st soumitra-st marked this pull request as draft April 23, 2024 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants