-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add max qps bucket count #5922
Add max qps bucket count #5922
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be setting the value to "Max qps within a second since the time the callback was invoked last"? In some systems, the polling may be more often than 1 minute, and in others less often. So, we sholuld keep a hit counter for some max time (say, 10m).
Also, once the poll is done, we should reset the hit counter so that we get the max from the polling time until the next time it is polled.
Note that in any system, it is not guaranteed that the polling is always in a certain interval. The time interval could be close to a certain period, but can fluctuate by some small percentage either way. So, if the poll comes in 65 seconds, and our first second had a burst, we will lose it (as per your implementation).
We don't want to make the metrics system stateful. Every time the callback method gets called, it should return whatever the value should be.Thus, I don't think it's good to reset the counts when the callback gets called. Otherwise, we should have changed all the metrics to be stateful in Pinot cluster.
I admit that the frequency of poll may vary. We can make it configurable. But the purpose of this new metric is to track the qps related statistics. It will only be emitted when the qps quota is set. If 10 mins is the granularity for a system to track qps, then I don't think they need to set qps quota for their tables.
This is a rare case. In fact, what we are trying to solve is to find a way to detect the burst of queries which last for a while. They may be ignored, but will never be ignored all the time. Plus, if it happens quite often, then I think there is some issue on polling instead of adjusting our stateless metric system. The callback function never knows when it will be called in advance; when the callback function gets called, it should return the exact max qps within a minute based on the requirement. |
There is a difference between this one and all the others that we have. In the others, the state is still there, it is being maintained by the metrics library. How else do you think we get percentiles? In this case, we want a much finer granularity, and hence the need for it to clear state and report the max that it recorded since the last call. Otherwise, we will be emitting metrics that do not reflect the real state as we want to measure it.
I am not sure how rare a case this is. |
I'm talking about the gauge values, not the meter values. The existing gauge values are maintained in a ConcurrentHashMap in
The case I mentioned here is that the it always neglects the burst of queries in the first second; the burst is so smart that it always happens at the time when the hit counter couldn't detect. |
Discussed offline. The "clearing" of values is logical. Implementation can be that we maintain a from/to handle on the circular buffer and count only those buckets populated since the last call. |
3b5b6e9
to
0569372
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some minor suggestions
pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/QueryQuotaEntity.java
Outdated
Show resolved
Hide resolved
pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/QueryQuotaEntity.java
Outdated
Show resolved
Hide resolved
private long _defaultQueriedTimeRangeMs; | ||
private long _lastAccessTimestamp; | ||
|
||
public StatefulHitCounter(int timeRangeInSeconds, int bucketCount, int defaultQueriedTimeRangeInSeconds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public StatefulHitCounter(int timeRangeInSeconds, int bucketCount, int defaultQueriedTimeRangeInSeconds) { | |
public StatefulHitCounter(int queriedTimeRangeInSeconds) { |
Derive the other two locally. So, the timeRange we maintain could be 2*queriedTimeRangeInSeconds
or even 1.5
times . the bucket count is also something that could be decided by the statefulHitCounter.
pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/StatefulHitCounter.java
Outdated
Show resolved
Hide resolved
pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/StatefulHitCounter.java
Outdated
Show resolved
Hide resolved
pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/StatefulHitCounter.java
Outdated
Show resolved
Hide resolved
pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/StatefulHitCounter.java
Outdated
Show resolved
Hide resolved
pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/StatefulHitCounter.java
Outdated
Show resolved
Hide resolved
pinot-broker/src/main/java/org/apache/pinot/broker/queryquota/StatefulHitCounter.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/metrics/BrokerGauge.java
Outdated
Show resolved
Hide resolved
a839d6b
to
9f3ada5
Compare
this(timeRangeInSeconds, timeRangeInSeconds * MAX_TIME_RANGE_FACTOR); | ||
} | ||
|
||
public MaxHitRateTracker(int defaultTimeRangeInSeconds, int maxTimeRangeInSeconds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically it's the maxTimeRangeInSeconds
that needs to be passed into the parent constructor instead of the default one. While maxTimeRangeInSeconds
has to be calculated from the default one. Thus, we'd have to re-calculate the maxTimeRangeInSeconds
multiple times before we assign it to this class. Creating an extra constructor gives us the ability to reduce the duplicate calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make that constructor private then?
} | ||
|
||
@VisibleForTesting | ||
int getMaxCountPerBucket(long now) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this method be synchronized?
I would code the method like this:
- Get the value of _lastAccessTimeStamp in the beginning of the method (
then = _lastAccessTimeStamp
) - Use
then
throughout the method - Set the
_lastAccessTimestamp
tonow
at the end of the method
That will protect us against multiple calls, if any (just in case).
it will also prevent us from accessing the volatile variable repeatedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it doesn't need to be synchronized.
The callback function will be called by 1 single thread. Thus, _lastAccessTimeStamp
will also be modified by the same thread.
The bucket belonging to the end index won't be queried, so there is no need to add the block on it. Plus, all the buckets have already been in AtomicIntegerArray. There is no need to add extra protection
@VisibleForTesting | ||
int getMaxCountPerBucket(long now) { | ||
// Update the last access timestamp if the hit counter didn't get queried for more than _maxTimeRangeMs. | ||
if (now - _lastAccessTimestamp > _maxTimeRangeMs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (now - _lastAccessTimestamp > _maxTimeRangeMs) { | |
then = _lastAccessTimeStamp; | |
if (now - then > _maxTimeRangeMs) { | |
then = now - _defaultTimeRangeMs; | |
} | |
long startTimeUnits = then / _timeBucketWidthMs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the change to pull the volatile variable once per call, we are good, thanks
9f3ada5
to
30103ce
Compare
Description
This PR adds the ability of getting maximum qps count within a minute.
In some monitoring system, metrics are emitted in some certain frequency, e.g. every 1 minute. So if there is a burst of qps hitting to the cluster, there is no way to reflect on the metrics.
This PR introduces a counter to get the maximum counts among all the seconds within a minute, so that we always know the real circumstances of the cluster.