Add max qps bucket count #5922

jackjlli · 2020-08-25T22:29:25Z

Description

This PR adds the ability of getting maximum qps count within a minute.

In some monitoring system, metrics are emitted in some certain frequency, e.g. every 1 minute. So if there is a burst of qps hitting to the cluster, there is no way to reflect on the metrics.
This PR introduces a counter to get the maximum counts among all the seconds within a minute, so that we always know the real circumstances of the cluster.

mcvsubbu

Shouldn't we be setting the value to "Max qps within a second since the time the callback was invoked last"? In some systems, the polling may be more often than 1 minute, and in others less often. So, we sholuld keep a hit counter for some max time (say, 10m).

Also, once the poll is done, we should reset the hit counter so that we get the max from the polling time until the next time it is polled.

Note that in any system, it is not guaranteed that the polling is always in a certain interval. The time interval could be close to a certain period, but can fluctuate by some small percentage either way. So, if the poll comes in 65 seconds, and our first second had a burst, we will lose it (as per your implementation).

jackjlli · 2020-08-26T05:11:57Z

Shouldn't we be setting the value to "Max qps within a second since the time the callback was invoked last"?

We don't want to make the metrics system stateful. Every time the callback method gets called, it should return whatever the value should be.Thus, I don't think it's good to reset the counts when the callback gets called. Otherwise, we should have changed all the metrics to be stateful in Pinot cluster.

In some systems, the polling may be more often than 1 minute, and in others less often. So, we should keep a hit counter for some max time (say, 10m).

I admit that the frequency of poll may vary. We can make it configurable. But the purpose of this new metric is to track the qps related statistics. It will only be emitted when the qps quota is set. If 10 mins is the granularity for a system to track qps, then I don't think they need to set qps quota for their tables.

The time interval could be close to a certain period, but can fluctuate by some small percentage either way. So, if the poll comes in 65 seconds, and our first second had a burst, we will lose it (as per your implementation).

This is a rare case. In fact, what we are trying to solve is to find a way to detect the burst of queries which last for a while. They may be ignored, but will never be ignored all the time. Plus, if it happens quite often, then I think there is some issue on polling instead of adjusting our stateless metric system. The callback function never knows when it will be called in advance; when the callback function gets called, it should return the exact max qps within a minute based on the requirement.

mcvsubbu · 2020-08-26T16:09:17Z

Shouldn't we be setting the value to "Max qps within a second since the time the callback was invoked last"?

We don't want to make the metrics system stateful. Every time the callback method gets called, it should return whatever the value should be.Thus, I don't think it's good to reset the counts when the callback gets called. Otherwise, we should have changed all the metrics to be stateful in Pinot cluster.

There is a difference between this one and all the others that we have. In the others, the state is still there, it is being maintained by the metrics library. How else do you think we get percentiles? In this case, we want a much finer granularity, and hence the need for it to clear state and report the max that it recorded since the last call. Otherwise, we will be emitting metrics that do not reflect the real state as we want to measure it.

In some systems, the polling may be more often than 1 minute, and in others less often. So, we should keep a hit counter for some max time (say, 10m).

I admit that the frequency of poll may vary. We can make it configurable. But the purpose of this new metric is to track the qps related statistics. It will only be emitted when the qps quota is set. If 10 mins is the granularity for a system to track qps, then I don't think they need to set qps quota for their tables.

The time interval could be close to a certain period, but can fluctuate by some small percentage either way. So, if the poll comes in 65 seconds, and our first second had a burst, we will lose it (as per your implementation).

This is a rare case. In fact, what we are trying to solve is to find a way to detect the burst of queries which last for a while. They may be ignored, but will never be ignored all the time. Plus, if it happens quite often, then I think there is some issue on polling instead of adjusting our stateless metric system. The callback function never knows when it will be called in advance; when the callback function gets called, it should return the exact max qps within a minute based on the requirement.

I am not sure how rare a case this is.

jackjlli · 2020-08-26T17:56:56Z

Shouldn't we be setting the value to "Max qps within a second since the time the callback was invoked last"?

We don't want to make the metrics system stateful. Every time the callback method gets called, it should return whatever the value should be.Thus, I don't think it's good to reset the counts when the callback gets called. Otherwise, we should have changed all the metrics to be stateful in Pinot cluster.

There is a difference between this one and all the others that we have. In the others, the state is still there, it is being maintained by the metrics library. How else do you think we get percentiles? In this case, we want a much finer granularity, and hence the need for it to clear state and report the max that it recorded since the last call. Otherwise, we will be emitting metrics that do not reflect the real state as we want to measure it.

I'm talking about the gauge values, not the meter values. The existing gauge values are maintained in a ConcurrentHashMap in MetrcisHelper class, which is Pinot's code. The emitted gauge value is just the one when the callback function gets called. There is no state at all.

In some systems, the polling may be more often than 1 minute, and in others less often. So, we should keep a hit counter for some max time (say, 10m).

I admit that the frequency of poll may vary. We can make it configurable. But the purpose of this new metric is to track the qps related statistics. It will only be emitted when the qps quota is set. If 10 mins is the granularity for a system to track qps, then I don't think they need to set qps quota for their tables.

The time interval could be close to a certain period, but can fluctuate by some small percentage either way. So, if the poll comes in 65 seconds, and our first second had a burst, we will lose it (as per your implementation).

This is a rare case. In fact, what we are trying to solve is to find a way to detect the burst of queries which last for a while. They may be ignored, but will never be ignored all the time. Plus, if it happens quite often, then I think there is some issue on polling instead of adjusting our stateless metric system. The callback function never knows when it will be called in advance; when the callback function gets called, it should return the exact max qps within a minute based on the requirement.

I am not sure how rare a case this is.

The case I mentioned here is that the it always neglects the burst of queries in the first second; the burst is so smart that it always happens at the time when the hit counter couldn't detect.
In another case, if the callback gets invoked 2 consecutive times which gap is small (maybe caused by GC). The stateful won't give you the correct number, as the value has already been reset by the first call.

mcvsubbu · 2020-08-26T23:57:48Z

Discussed offline. The "clearing" of values is logical. Implementation can be that we maintain a from/to handle on the circular buffer and count only those buckets populated since the last call.

mcvsubbu

Looks good, some minor suggestions

pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/QueryQuotaEntity.java

mcvsubbu · 2020-08-27T17:54:29Z

pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/StatefulHitCounter.java

+  private long _defaultQueriedTimeRangeMs;
+  private long _lastAccessTimestamp;
+
+  public StatefulHitCounter(int timeRangeInSeconds, int bucketCount, int defaultQueriedTimeRangeInSeconds) {


Suggested change

public StatefulHitCounter(int timeRangeInSeconds, int bucketCount, int defaultQueriedTimeRangeInSeconds) {

public StatefulHitCounter(int queriedTimeRangeInSeconds) {

Derive the other two locally. So, the timeRange we maintain could be 2*queriedTimeRangeInSeconds or even 1.5 times . the bucket count is also something that could be decided by the statefulHitCounter.

pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/StatefulHitCounter.java

pinot-common/src/main/java/org/apache/pinot/common/metrics/BrokerGauge.java

mcvsubbu · 2020-08-27T22:44:47Z

pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/MaxHitRateTracker.java

+    this(timeRangeInSeconds, timeRangeInSeconds * MAX_TIME_RANGE_FACTOR);
+  }
+
+  public MaxHitRateTracker(int defaultTimeRangeInSeconds, int maxTimeRangeInSeconds) {


why do we need this constructor?

Basically it's the maxTimeRangeInSeconds that needs to be passed into the parent constructor instead of the default one. While maxTimeRangeInSeconds has to be calculated from the default one. Thus, we'd have to re-calculate the maxTimeRangeInSeconds multiple times before we assign it to this class. Creating an extra constructor gives us the ability to reduce the duplicate calculation.

Can we make that constructor private then?

mcvsubbu · 2020-08-27T22:47:07Z

pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/MaxHitRateTracker.java

+  }
+
+  @VisibleForTesting
+  int getMaxCountPerBucket(long now) {


should this method be synchronized?
I would code the method like this:

Get the value of _lastAccessTimeStamp in the beginning of the method (then = _lastAccessTimeStamp)

Use then throughout the method

Set the _lastAccessTimestamp to now at the end of the method

That will protect us against multiple calls, if any (just in case).

it will also prevent us from accessing the volatile variable repeatedly.

No, it doesn't need to be synchronized.
The callback function will be called by 1 single thread. Thus, _lastAccessTimeStamp will also be modified by the same thread.
The bucket belonging to the end index won't be queried, so there is no need to add the block on it. Plus, all the buckets have already been in AtomicIntegerArray. There is no need to add extra protection

mcvsubbu · 2020-08-27T22:50:21Z

pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/MaxHitRateTracker.java

+  @VisibleForTesting
+  int getMaxCountPerBucket(long now) {
+    // Update the last access timestamp if the hit counter didn't get queried for more than _maxTimeRangeMs.
+    if (now - _lastAccessTimestamp > _maxTimeRangeMs) {


Suggested change

if (now - _lastAccessTimestamp > _maxTimeRangeMs) {

then = _lastAccessTimeStamp;

if (now - then > _maxTimeRangeMs) {

then = now - _defaultTimeRangeMs;

}

long startTimeUnits = then / _timeBucketWidthMs;

mcvsubbu

Other than the change to pull the volatile variable once per call, we are good, thanks

Add max qps bucket count

b66b420

jackjlli requested a review from mcvsubbu August 25, 2020 22:29

mcvsubbu reviewed Aug 26, 2020

View reviewed changes

Introduce StatefulHitCounter

0569372

jackjlli force-pushed the add-max-qps-bucket-count branch from 3b5b6e9 to 0569372 Compare August 27, 2020 17:07

mcvsubbu reviewed Aug 27, 2020

View reviewed changes

jackjlli force-pushed the add-max-qps-bucket-count branch from a839d6b to 9f3ada5 Compare August 27, 2020 20:50

mcvsubbu reviewed Aug 27, 2020

View reviewed changes

mcvsubbu approved these changes Aug 27, 2020

View reviewed changes

Address PR comments

30103ce

jackjlli force-pushed the add-max-qps-bucket-count branch from 9f3ada5 to 30103ce Compare August 28, 2020 00:03

jackjlli merged commit 6b78dcc into master Aug 28, 2020

jackjlli deleted the add-max-qps-bucket-count branch August 28, 2020 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add max qps bucket count #5922

Add max qps bucket count #5922

jackjlli commented Aug 25, 2020

mcvsubbu left a comment

jackjlli commented Aug 26, 2020 •

edited

Loading

mcvsubbu commented Aug 26, 2020

jackjlli commented Aug 26, 2020

mcvsubbu commented Aug 26, 2020

mcvsubbu left a comment

mcvsubbu Aug 27, 2020

mcvsubbu Aug 27, 2020

jackjlli Aug 27, 2020 •

edited

Loading

mcvsubbu Aug 27, 2020

mcvsubbu Aug 27, 2020

jackjlli Aug 27, 2020

mcvsubbu Aug 27, 2020

mcvsubbu left a comment

	public StatefulHitCounter(int timeRangeInSeconds, int bucketCount, int defaultQueriedTimeRangeInSeconds) {
	public StatefulHitCounter(int queriedTimeRangeInSeconds) {

-    if (now - _lastAccessTimestamp > _maxTimeRangeMs) {
+    then = _lastAccessTimeStamp;
+    if (now - then > _maxTimeRangeMs) {
+        then = now - _defaultTimeRangeMs;
+    }
+    long startTimeUnits = then / _timeBucketWidthMs;

Add max qps bucket count #5922

Add max qps bucket count #5922

Conversation

jackjlli commented Aug 25, 2020

Description

mcvsubbu left a comment

Choose a reason for hiding this comment

jackjlli commented Aug 26, 2020 • edited Loading

mcvsubbu commented Aug 26, 2020

jackjlli commented Aug 26, 2020

mcvsubbu commented Aug 26, 2020

mcvsubbu left a comment

Choose a reason for hiding this comment

mcvsubbu Aug 27, 2020

Choose a reason for hiding this comment

mcvsubbu Aug 27, 2020

Choose a reason for hiding this comment

jackjlli Aug 27, 2020 • edited Loading

Choose a reason for hiding this comment

mcvsubbu Aug 27, 2020

Choose a reason for hiding this comment

mcvsubbu Aug 27, 2020

Choose a reason for hiding this comment

jackjlli Aug 27, 2020

Choose a reason for hiding this comment

mcvsubbu Aug 27, 2020

Choose a reason for hiding this comment

mcvsubbu left a comment

Choose a reason for hiding this comment

jackjlli commented Aug 26, 2020 •

edited

Loading

jackjlli Aug 27, 2020 •

edited

Loading