Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max qps bucket count #5922

Merged
merged 3 commits into from
Aug 28, 2020
Merged

Add max qps bucket count #5922

merged 3 commits into from
Aug 28, 2020

Conversation

jackjlli
Copy link
Member

Description

This PR adds the ability of getting maximum qps count within a minute.

In some monitoring system, metrics are emitted in some certain frequency, e.g. every 1 minute. So if there is a burst of qps hitting to the cluster, there is no way to reflect on the metrics.
This PR introduces a counter to get the maximum counts among all the seconds within a minute, so that we always know the real circumstances of the cluster.

@jackjlli jackjlli requested a review from mcvsubbu August 25, 2020 22:29
Copy link
Contributor

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be setting the value to "Max qps within a second since the time the callback was invoked last"? In some systems, the polling may be more often than 1 minute, and in others less often. So, we sholuld keep a hit counter for some max time (say, 10m).

Also, once the poll is done, we should reset the hit counter so that we get the max from the polling time until the next time it is polled.

Note that in any system, it is not guaranteed that the polling is always in a certain interval. The time interval could be close to a certain period, but can fluctuate by some small percentage either way. So, if the poll comes in 65 seconds, and our first second had a burst, we will lose it (as per your implementation).

@jackjlli
Copy link
Member Author

jackjlli commented Aug 26, 2020

Shouldn't we be setting the value to "Max qps within a second since the time the callback was invoked last"?

We don't want to make the metrics system stateful. Every time the callback method gets called, it should return whatever the value should be.Thus, I don't think it's good to reset the counts when the callback gets called. Otherwise, we should have changed all the metrics to be stateful in Pinot cluster.

In some systems, the polling may be more often than 1 minute, and in others less often. So, we should keep a hit counter for some max time (say, 10m).

I admit that the frequency of poll may vary. We can make it configurable. But the purpose of this new metric is to track the qps related statistics. It will only be emitted when the qps quota is set. If 10 mins is the granularity for a system to track qps, then I don't think they need to set qps quota for their tables.

The time interval could be close to a certain period, but can fluctuate by some small percentage either way. So, if the poll comes in 65 seconds, and our first second had a burst, we will lose it (as per your implementation).

This is a rare case. In fact, what we are trying to solve is to find a way to detect the burst of queries which last for a while. They may be ignored, but will never be ignored all the time. Plus, if it happens quite often, then I think there is some issue on polling instead of adjusting our stateless metric system. The callback function never knows when it will be called in advance; when the callback function gets called, it should return the exact max qps within a minute based on the requirement.

@mcvsubbu
Copy link
Contributor

Shouldn't we be setting the value to "Max qps within a second since the time the callback was invoked last"?

We don't want to make the metrics system stateful. Every time the callback method gets called, it should return whatever the value should be.Thus, I don't think it's good to reset the counts when the callback gets called. Otherwise, we should have changed all the metrics to be stateful in Pinot cluster.

There is a difference between this one and all the others that we have. In the others, the state is still there, it is being maintained by the metrics library. How else do you think we get percentiles? In this case, we want a much finer granularity, and hence the need for it to clear state and report the max that it recorded since the last call. Otherwise, we will be emitting metrics that do not reflect the real state as we want to measure it.

In some systems, the polling may be more often than 1 minute, and in others less often. So, we should keep a hit counter for some max time (say, 10m).

I admit that the frequency of poll may vary. We can make it configurable. But the purpose of this new metric is to track the qps related statistics. It will only be emitted when the qps quota is set. If 10 mins is the granularity for a system to track qps, then I don't think they need to set qps quota for their tables.

The time interval could be close to a certain period, but can fluctuate by some small percentage either way. So, if the poll comes in 65 seconds, and our first second had a burst, we will lose it (as per your implementation).

This is a rare case. In fact, what we are trying to solve is to find a way to detect the burst of queries which last for a while. They may be ignored, but will never be ignored all the time. Plus, if it happens quite often, then I think there is some issue on polling instead of adjusting our stateless metric system. The callback function never knows when it will be called in advance; when the callback function gets called, it should return the exact max qps within a minute based on the requirement.

I am not sure how rare a case this is.

@jackjlli
Copy link
Member Author

Shouldn't we be setting the value to "Max qps within a second since the time the callback was invoked last"?

We don't want to make the metrics system stateful. Every time the callback method gets called, it should return whatever the value should be.Thus, I don't think it's good to reset the counts when the callback gets called. Otherwise, we should have changed all the metrics to be stateful in Pinot cluster.

There is a difference between this one and all the others that we have. In the others, the state is still there, it is being maintained by the metrics library. How else do you think we get percentiles? In this case, we want a much finer granularity, and hence the need for it to clear state and report the max that it recorded since the last call. Otherwise, we will be emitting metrics that do not reflect the real state as we want to measure it.

I'm talking about the gauge values, not the meter values. The existing gauge values are maintained in a ConcurrentHashMap in MetrcisHelper class, which is Pinot's code. The emitted gauge value is just the one when the callback function gets called. There is no state at all.

In some systems, the polling may be more often than 1 minute, and in others less often. So, we should keep a hit counter for some max time (say, 10m).

I admit that the frequency of poll may vary. We can make it configurable. But the purpose of this new metric is to track the qps related statistics. It will only be emitted when the qps quota is set. If 10 mins is the granularity for a system to track qps, then I don't think they need to set qps quota for their tables.

The time interval could be close to a certain period, but can fluctuate by some small percentage either way. So, if the poll comes in 65 seconds, and our first second had a burst, we will lose it (as per your implementation).

This is a rare case. In fact, what we are trying to solve is to find a way to detect the burst of queries which last for a while. They may be ignored, but will never be ignored all the time. Plus, if it happens quite often, then I think there is some issue on polling instead of adjusting our stateless metric system. The callback function never knows when it will be called in advance; when the callback function gets called, it should return the exact max qps within a minute based on the requirement.

I am not sure how rare a case this is.

The case I mentioned here is that the it always neglects the burst of queries in the first second; the burst is so smart that it always happens at the time when the hit counter couldn't detect.
In another case, if the callback gets invoked 2 consecutive times which gap is small (maybe caused by GC). The stateful won't give you the correct number, as the value has already been reset by the first call.

@mcvsubbu
Copy link
Contributor

Discussed offline. The "clearing" of values is logical. Implementation can be that we maintain a from/to handle on the circular buffer and count only those buckets populated since the last call.

@jackjlli jackjlli force-pushed the add-max-qps-bucket-count branch from 3b5b6e9 to 0569372 Compare August 27, 2020 17:07
Copy link
Contributor

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor suggestions

private long _defaultQueriedTimeRangeMs;
private long _lastAccessTimestamp;

public StatefulHitCounter(int timeRangeInSeconds, int bucketCount, int defaultQueriedTimeRangeInSeconds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public StatefulHitCounter(int timeRangeInSeconds, int bucketCount, int defaultQueriedTimeRangeInSeconds) {
public StatefulHitCounter(int queriedTimeRangeInSeconds) {

Derive the other two locally. So, the timeRange we maintain could be 2*queriedTimeRangeInSeconds or even 1.5 times . the bucket count is also something that could be decided by the statefulHitCounter.

@jackjlli jackjlli force-pushed the add-max-qps-bucket-count branch from a839d6b to 9f3ada5 Compare August 27, 2020 20:50
this(timeRangeInSeconds, timeRangeInSeconds * MAX_TIME_RANGE_FACTOR);
}

public MaxHitRateTracker(int defaultTimeRangeInSeconds, int maxTimeRangeInSeconds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this constructor?

Copy link
Member Author

@jackjlli jackjlli Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically it's the maxTimeRangeInSeconds that needs to be passed into the parent constructor instead of the default one. While maxTimeRangeInSeconds has to be calculated from the default one. Thus, we'd have to re-calculate the maxTimeRangeInSeconds multiple times before we assign it to this class. Creating an extra constructor gives us the ability to reduce the duplicate calculation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make that constructor private then?

}

@VisibleForTesting
int getMaxCountPerBucket(long now) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this method be synchronized?
I would code the method like this:

  • Get the value of _lastAccessTimeStamp in the beginning of the method (then = _lastAccessTimeStamp)
  • Use then throughout the method
  • Set the _lastAccessTimestamp to now at the end of the method

That will protect us against multiple calls, if any (just in case).

it will also prevent us from accessing the volatile variable repeatedly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't need to be synchronized.
The callback function will be called by 1 single thread. Thus, _lastAccessTimeStamp will also be modified by the same thread.
The bucket belonging to the end index won't be queried, so there is no need to add the block on it. Plus, all the buckets have already been in AtomicIntegerArray. There is no need to add extra protection

@VisibleForTesting
int getMaxCountPerBucket(long now) {
// Update the last access timestamp if the hit counter didn't get queried for more than _maxTimeRangeMs.
if (now - _lastAccessTimestamp > _maxTimeRangeMs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (now - _lastAccessTimestamp > _maxTimeRangeMs) {
then = _lastAccessTimeStamp;
if (now - then > _maxTimeRangeMs) {
then = now - _defaultTimeRangeMs;
}
long startTimeUnits = then / _timeBucketWidthMs;

Copy link
Contributor

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the change to pull the volatile variable once per call, we are good, thanks

@jackjlli jackjlli force-pushed the add-max-qps-bucket-count branch from 9f3ada5 to 30103ce Compare August 28, 2020 00:03
@jackjlli jackjlli merged commit 6b78dcc into master Aug 28, 2020
@jackjlli jackjlli deleted the add-max-qps-bucket-count branch August 28, 2020 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants