Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix negative lag metircs issue + improve API design for parition lag #17060

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

panhongan
Copy link
Contributor

@panhongan panhongan commented Sep 13, 2024

Fixes #XXXX.
org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor org.apache.druid.indexing.kafka.supervisor.KafkaSupervisor org.apache.druid.indexing.kinesis.supervisor.KinesisSupervisor.java org.apache.druid.indexing.rabbitstream.supervisor.RabbitStreamSupervisor

Description

Problem 1: negtive lag issue for 2 scenarios
S1: No issue for kafka, but there is occasional negative lag due to thread-safe issue. (we should fix this issue)
S2: If can't connect to kafka or kafka connection was broken, we can see negative lag. (negative is helpful to application, should not be skipped)

skip emitting if there was negative lag, this is bad idea.
For S1, looks no impact on data ingestion, only emitter metrics is missing at few time points.
For S2, some companies already build monitoring based on the negative partition lag, if negative lag was not reported, their monitor will not work.
Back to the negative lag issue, I think we should fix it instead of skipping.

For S1, why there was negative lag?
In class SeekableStreamSupervisor:

updateCurrentAndLatestOffsets() : PT30S
--updateCurrentOffsets() -> get task reading offset : OFFSET1
--updatePartitionLagFromStream() -> get partition writing end offset : OFFSET2

in another thread:
/druid/indexer/v1/supervisor//status -> SeekableStreamSupervisor::getStatus() -> SeekableStreamSupervisor::generateReport() -> calculation lag = OFFSET2 - OFFSET1

When the negative lag issue happend?

  1. updateCurrentAndLatestOffsets() : executed
  2. /druid/indexer/v1/supervisor//status : invoked
  3. updatePartitionLagFromStream() : not executed

So the idea is we can make the OFFSET1 & OFFSET2 have the same version.

=================================================================================
Problem 2: Bad design for getPartitionRecordLag() & getPartitionTimeLag()

We want to support 2 kinds of partition lag, but like KafkaSupervisor, only need record partition lag. Like KinesisSupervisor, only need time partition lag.
For expansibility, if we want to add another type of partition lag, do we plan to add another method like: getPartition*Lag() ?
And all the existing *Supervisor class need to implements the new method, sounds not make sense.
From the design pattern side, provide 1 method with returned type is better than provide 2 separated methods.

protected abstract Pair<StreamPartitionLagType, Map<PartitionIdType, Long>> getPartitionLag();

vs

protected abstract Map<PartitionIdType, Long> getPartitionRecordLag();
protected abstract Map<PartitionIdType, Long> getPartitionTimeLag();

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@abhishekrb19
Copy link
Contributor

@panhongan I haven't looked into the changes yet. Could you please update the PR summary to describe what negative lag metrics issue you were noticing, which Druid version it was observed in, etc? One such negative lag reporting was fixed in #14292.

@panhongan
Copy link
Contributor Author

@panhongan I haven't looked into the changes yet. Could you please update the PR summary to describe what negative lag metrics issue you were noticing, which Druid version it was observed in, etc? One such negative lag reporting was fixed in #14292.

@abhishekrb19 @AmatyaAvadhanula help review, Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants