fix negative lag metircs issue + improve API design for parition lag #17060

panhongan · 2024-09-13T15:30:40Z

Fixes #XXXX.
org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor org.apache.druid.indexing.kafka.supervisor.KafkaSupervisor org.apache.druid.indexing.kinesis.supervisor.KinesisSupervisor.java org.apache.druid.indexing.rabbitstream.supervisor.RabbitStreamSupervisor

Description

Problem 1: negtive lag issue for 2 scenarios
S1: No issue for kafka, but there is occasional negative lag due to thread-safe issue. (we should fix this issue)
S2: If can't connect to kafka or kafka connection was broken, we can see negative lag. (negative is helpful to application, should not be skipped)

skip emitting if there was negative lag, this is bad idea.
For S1, looks no impact on data ingestion, only emitter metrics is missing at few time points.
For S2, some companies already build monitoring based on the negative partition lag, if negative lag was not reported, their monitor will not work.
Back to the negative lag issue, I think we should fix it instead of skipping.

For S1, why there was negative lag?
In class SeekableStreamSupervisor:

updateCurrentAndLatestOffsets() : PT30S
--updateCurrentOffsets() -> get task reading offset : OFFSET1
--updatePartitionLagFromStream() -> get partition writing end offset : OFFSET2

in another thread:
/druid/indexer/v1/supervisor//status -> SeekableStreamSupervisor::getStatus() -> SeekableStreamSupervisor::generateReport() -> calculation lag = OFFSET2 - OFFSET1

When the negative lag issue happend?

updateCurrentAndLatestOffsets() : executed
/druid/indexer/v1/supervisor//status : invoked
updatePartitionLagFromStream() : not executed

So the idea is we can make the OFFSET1 & OFFSET2 have the same version.

=================================================================================
Problem 2: Bad design for getPartitionRecordLag() & getPartitionTimeLag()

We want to support 2 kinds of partition lag, but like KafkaSupervisor, only need record partition lag. Like KinesisSupervisor, only need time partition lag.
For expansibility, if we want to add another type of partition lag, do we plan to add another method like: getPartition*Lag() ?
And all the existing *Supervisor class need to implements the new method, sounds not make sense.
From the design pattern side, provide 1 method with returned type is better than provide 2 separated methods.

protected abstract Pair<StreamPartitionLagType, Map<PartitionIdType, Long>> getPartitionLag();

vs

protected abstract Map<PartitionIdType, Long> getPartitionRecordLag();
protected abstract Map<PartitionIdType, Long> getPartitionTimeLag();

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note

Key changed/added classes in this PR

MyFoo
OurBar
TheirBaz

This PR has:

abhishekrb19 · 2024-09-18T00:23:33Z

@panhongan I haven't looked into the changes yet. Could you please update the PR summary to describe what negative lag metrics issue you were noticing, which Druid version it was observed in, etc? One such negative lag reporting was fixed in #14292.

panhongan · 2024-09-27T00:57:14Z

@panhongan I haven't looked into the changes yet. Could you please update the PR summary to describe what negative lag metrics issue you were noticing, which Druid version it was observed in, etc? One such negative lag reporting was fixed in #14292.

@abhishekrb19 @AmatyaAvadhanula help review, Thanks.

1. fix negative lag metircs issue; 2.improve API design for parition lag

fe4493c

github-actions bot added Area - Streaming Ingestion Area - Dependencies Area - Ingestion labels Sep 13, 2024

panhongan added 7 commits September 13, 2024 23:32

pom issue

7e23f97

fix style issue

328683f

fix style issue

2f16860

fix style issue

2a80e88

add kinesis pendency

a338166

add kafka indexing pendency

ba9d3a9

add rabbit indexing pendency

162d198

AmatyaAvadhanula self-requested a review September 16, 2024 09:21

fix unit test issue

bc42b12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix negative lag metircs issue + improve API design for parition lag #17060

fix negative lag metircs issue + improve API design for parition lag #17060

panhongan commented Sep 13, 2024 •

edited

Loading

abhishekrb19 commented Sep 18, 2024

panhongan commented Sep 27, 2024

fix negative lag metircs issue + improve API design for parition lag #17060

Are you sure you want to change the base?

fix negative lag metircs issue + improve API design for parition lag #17060

Conversation

panhongan commented Sep 13, 2024 • edited Loading

Description

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note

Key changed/added classes in this PR

abhishekrb19 commented Sep 18, 2024

panhongan commented Sep 27, 2024

panhongan commented Sep 13, 2024 •

edited

Loading