-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: introduce cpu load based splits #96128
Conversation
05cbe42
to
f720659
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r4, 21 of 21 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)
-- commits
line 84 at r5:
How was the 250ms number evaluated? Should we run an experiment like Andrew did in #39687 to tune it to find the right balance between batching and concurrency?
pkg/kv/kvserver/batcheval/cmd_range_stats.go
line 54 at r4 (raw file):
reply.MaxQueriesPerSecond = -1 } reply.MaxQueriesPerSecondSet = true
Do we need to keep setting this to true for another release?
pkg/kv/kvserver/batcheval/cmd_range_stats.go
line 54 at r5 (raw file):
reply.MaxCPUPerSecond, reply.MaxQueriesPerSecond = -1, -1 if qpsOK && cpuOK { return result.Result{}, errors.AssertionFailedf("unexpected both QPS and CPU range statistics set")
Is this possible if the cluster switched from QPS-based splitting to CPU-based splitting between the calls to GetMaxSplitQPS
and GetMaxSplitCPU
above?
pkg/kv/kvserver/batcheval/eval_context.go
line 92 at r5 (raw file):
GetMaxSplitQPS(context.Context) (float64, bool) // GetMaxSplitCPU returns the Replicas maximum request cpu/s rate over a
"Replica's"
Same thing above as well, I guess.
pkg/kv/kvserver/split/decider.go
line 59 at r5 (raw file):
// NewLoadBasedSplitter returns a new LoadBasedSplitter that may be used to // find the midpoint based on recorded load. NewLoadBasedSplitter(time.Time) LoadBasedSplitter
nit: if this can construct LoadBasedSplitters it feels more like a "factory" than a "config".
pkg/roachpb/api.proto
line 2119 at r5 (raw file):
// replica serving the RangeStats request has not been the leaseholder long // enough to have recorded CPU rates for at least a full measurement period. // In such cases, the reciipient should not consider the CPU value reliable
s/reciipient/recipient/
f720659
to
337da71
Compare
06abb86
to
d1157da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
How was the 250ms number evaluated? Should we run an experiment like Andrew did in #39687 to tune it to find the right balance between batching and concurrency?
Mostly eyeballed this one whilst running allocbench. I tried a few different values and looked at the this metric: https://github.com/kvoli/cockroach/blob/f7206595e95e40b107755180a9438dd9b8bf0fd9/pkg/kv/kvserver/store_rebalancer.go#L48-L48 which indicates that there weren't any available actions but the store was overfull. I verified the number was "reasonable" by using it for tpcc and tpce - where the
Note the 250ms number is only for request cpu (not raft). Here's a comparison of splits running TPCE between master/this branch with 250ms:
The same for allocbench (5 runs of each type, order is r=0/access=skew
, r=0/ops=skew
, r=50/ops=skew
, r=95/access=skew
, r=95/ops=skew
.
I added a todo for now with the intention to address this in the next 2 weeks.
pkg/kv/kvserver/batcheval/cmd_range_stats.go
line 54 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need to keep setting this to true for another release?
Probably, If a <23.1 node sees this flag not being set (default=false?), I'm guessing that range merging will occur until the version is finalized.
Updated this to be set. Updated the TODO to stop setting the field in 23.2.
pkg/kv/kvserver/batcheval/cmd_range_stats.go
line 54 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is this possible if the cluster switched from QPS-based splitting to CPU-based splitting between the calls to
GetMaxSplitQPS
andGetMaxSplitCPU
above?
It is possible. However in that case, there is a "warmup" period for the decider stats so they would both be !OK.
When the objective changes, every decider is reset. It will then be 5 minutes before the max value returned from a decider is ok. So I think this should be okay.
pkg/kv/kvserver/batcheval/eval_context.go
line 92 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"Replica's"
Same thing above as well, I guess.
Updated
pkg/kv/kvserver/split/decider.go
line 59 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: if this can construct LoadBasedSplitters it feels more like a "factory" than a "config".
That's fair - it seemed convenient to bundle these methods into a config. I'm not sure if factory suits either here due to the config methods also being present.
pkg/roachpb/api.proto
line 2119 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/reciipient/recipient/
Updated
d1157da
to
d8c7434
Compare
d8c7434
to
3c26d1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once the prereq PR is merged.
Reviewed 1 of 3 files at r7, 18 of 22 files at r8, 21 of 21 files at r10, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)
Previously, kvoli (Austen) wrote…
Mostly eyeballed this one whilst running allocbench. I tried a few different values and looked at the this metric: https://github.com/kvoli/cockroach/blob/f7206595e95e40b107755180a9438dd9b8bf0fd9/pkg/kv/kvserver/store_rebalancer.go#L48-L48 which indicates that there weren't any available actions but the store was overfull. I verified the number was "reasonable" by using it for tpcc and tpce - where the
Note the 250ms number is only for request cpu (not raft). Here's a comparison of splits running TPCE between master/this branch with 250ms:
The same for allocbench (5 runs of each type, order is
r=0/access=skew
,r=0/ops=skew
,r=50/ops=skew
,r=95/access=skew
,r=95/ops=skew
.
I added a todo for now with the intention to address this in the next 2 weeks.
Do you mind opening a github issue that follows from the TODO, just so that we can tag something as a release blocker?
pkg/kv/kvserver/batcheval/cmd_range_stats.go
line 54 at r5 (raw file):
Previously, kvoli (Austen) wrote…
It is possible. However in that case, there is a "warmup" period for the decider stats so they would both be !OK.
When the objective changes, every decider is reset. It will then be 5 minutes before the max value returned from a decider is ok. So I think this should be okay.
I guess it still seems strange that we try to assert this. Would there be a downside to letting both be set in the rare case of an (absurdly long) race? Is it important that we promise to consumers of this API that only one value will be set? Or is it only important that we promise that at least one will be set (unless both are warming up)?
pkg/kv/kvserver/split/decider.go
line 59 at r5 (raw file):
Previously, kvoli (Austen) wrote…
That's fair - it seemed convenient to bundle these methods into a config. I'm not sure if factory suits either here due to the config methods also being present.
Ack. Config seems fine then. Unless LoadSplitThing
strikes your fancy.
3c26d1a
to
e6ffdc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do you mind opening a github issue that follows from the TODO, just so that we can tag something as a release blocker?
Opened issue #96869 and linked in the TODO.
pkg/kv/kvserver/batcheval/cmd_range_stats.go
line 54 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I guess it still seems strange that we try to assert this. Would there be a downside to letting both be set in the rare case of an (absurdly long) race? Is it important that we promise to consumers of this API that only one value will be set? Or is it only important that we promise that at least one will be set (unless both are warming up)?
I think the promise that at most one will be set is important to the merge queue, I don't think it is that important to other consumers of the API. Perhaps it is fine to just have the assertion on the other side. There is a sort of liveness property that eventually a value must be set for merges to make progress.
e6ffdc0
to
1b97efb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @kvoli)
pkg/kv/kvserver/batcheval/cmd_range_stats.go
line 54 at r5 (raw file):
Perhaps it is fine to just have the assertion on the other side.
Yeah, that's kind of what I was getting at. But even then, why assert? I would expect the logic to be something like the following, but I'm probably missing some complexity:
switch rebalance_objective:
case qps:
if !lhsQPS.ok || !rhcQPS.ok {
return // don't merge
}
...
case cpu:
if !lhsCPU.ok || !rhcCPU.ok {
return // don't merge
}
...
}
1b97efb
to
e84aa99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvserver/batcheval/cmd_range_stats.go
line 54 at r5 (raw file):
Yeah, that's kind of what I was getting at. But even then, why assert? I would expect the logic to be something like the following, but I'm probably missing some complexity:
I think the scenario I'm trying to avoid is the upgrade case, where some node doesn't populate CPU because it is running an earlier version but it then passes the >= 0 check by being the default value (0). This problem seems solved by first checking the split objective. It was also a bit cumbersome to have two stats for each side moving around in the process loop - for formatting and resetting after merge.
I've updated to remove the assertions and just check the objective manager to ensure things work.
e37cc28
to
7a9ddd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 21 of 21 files at r21, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist)
973036e
to
e887bf0
Compare
rebased with #96921 to fix CI. |
This patch removes the deprecated 'lastSplitQPS' value throughout the split/merge code. This field was deprecated in 22.1 in favor or `maxSplitQPS` and stopped being consulted in 22.2. A remaining field `max_queries_per_second_set` will remain until 23.2 where it can be removed as no node should consult it. Now only `maxSplitQPS` is consulted and set in `RangeStatsResponse`. Release note: None
This commit adds the ability to perform load based splitting with replica cpu usage rather than queries per second. Load based splitting now will use either cpu or qps for deciding split points, depending on the cluster setting `kv.allocator.load_based_rebalancing.objective`. When set to `qps`, qps is used in deciding split points and when splitting should occur; similarly, `cpu` means that request cpu against the leaseholder replica is to decide split points. The split threshold when using `cpu` is the cluster setting `kv.range_split.load_cpu_threshold` which defaults to `250ms` of cpu time per second, i.e. a replica using 1/4 processor of a machine consistently. The merge queue uses the load based splitter to make decisions on whether to merge two adjacent ranges due to low load. This commit also updates the merge queue to be consistent with the load based splitter signal. When switching between `qps` and `cpu`, the load based splitter for every replica is reset to avoid spurious results. resolves: cockroachdb#95377 Release note (ops change): Load based splitter now supports using request cpu usage to split ranges. This is introduced with the previous cluster setting `kv.allocator.load_based_rebalancing.objective`, which when set to `cpu`, will use request cpu usage. The threshold above which CPU usage of a range is considered for splitting is defined in the cluster setting `kv.range_split.load_cpu_threshold`, which has a default value of `250ms`.
e887bf0
to
408e227
Compare
rebased with #96946 to fix extended CI |
bors r=nvanbenschoten |
Build succeeded: |
This PR adds the ability to perform load based splitting with replica
cpu usage rather than queries per second. Load based splitting now will
use either cpu or qps for deciding split points, depending on the
cluster setting
kv.allocator.load_based_rebalancing.objective
.When set to
qps
, qps is used in deciding split points and whensplitting should occur; similarly,
cpu
means that request cpu againstthe leaseholder replica is to decide split points.
The split threshold when using
cpu
is the cluster settingkv.range_split.load_cpu_threshold
which defaults to250ms
of cputime per second, i.e. a replica using 1/4 processor of a machine
consistently.
The merge queue uses the load based splitter to make decisions on
whether to merge two adjacent ranges due to low load. This commit also
updates the merge queue to be consistent with the load based splitter
signal. When switching between
qps
andcpu
, the load based splitterfor every replica is reset to avoid spurious results.
resolves: #95377
resolves: #83519
depends on: #96127
Release note (ops change): Load based splitter now supports using request
cpu usage to split ranges. This is introduced with the previous cluster
setting
kv.allocator.load_based_rebalancing.objective
, which when setto
cpu
, will use request cpu usage. The threshold above whichCPU usage of a range is considered for splitting is defined in the
cluster setting
kv.range_split.load_cpu_threshold
, which has a defaultvalue of
250ms
.