-
Notifications
You must be signed in to change notification settings - Fork 4k
asim: introduce eval configurations and a chart viewer #150847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
759521a to
59fdf0d
Compare
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained
-- commits line 20 at r4:
this will be called "configs" later.
-- commits line 69 at r10:
This is stale, since the generated files are excluded (with one exception, example_rebalancing, for illustration). I hadn't yet done that when I first wrote this commit, but later rearranged things.
-- commits line 149 at r23:
as in, the files were completely identical, even run under the same config (mma-only) despite the one file name claiming it was running SMA.
I must have broken that sometime in the past.
pkg/kv/kvserver/asim/tests/testdata/generated/example_rebalancing/example_rebalancing_2_qps.png line 0 at r2 (raw file):
These shouldn't have been removed in r2 but in r1. Oh well.
pkg/kv/kvserver/asim/tests/testdata/viewer.html line 4 at r13 (raw file):
<html lang="en"> <head> <meta charset="UTF-8">
Made with Claude Code... I've looked at it a little bit while iterating on it, but I'm definitely not the expert on this. I don't think reviewing this too closely makes sense, better to try it out.
pkg/kv/kvserver/asim/tests/plotting_test.go line 115 at r13 (raw file):
// TODO(tbg): there are too many metrics to print here; they clutter up // the datadriven output. We should rely on assertions instead. if false {
A later commit will allow passing a list of metrics to eval and they will be printed here. It's a compromise for now.
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 180 at r2 (raw file):
// ..US_3 // ....└── [11 12 13 14 15] func TestDataDriven(t *testing.T) {
reminder: need to update the help text above with the updates.
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma_high_cpu_unable_to_shed_leases.txt line 36 at r22 (raw file):
{s1,s3:*,s5}:6 gen_load rate=1000 rw_ratio=1.0 min_key=0 max_key=10000 raft_cpu_per_write=500000
This seems wrong. The node is supposed to be overloaded due to RAFT cpu. Now I turned this into a read-only workload and raft_cpu_per_write isn't even used.
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 441 at r20 (raw file):
// manually. "default": func(*gen.StaticEvents) {}, // 'mma-only' runs with the multi-metric allocator and turns off the
It's worth thinking about these configs. Right now I'm lumping the SMA StoreRebalancer into the "sma" variable. It won't ever make sense to turn on both the legacy and MMA store rebalancer at the same time, so so far it seems to work out - there is no reason to want "both store rebalancers on".
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 562 at r7 (raw file):
return "" case "setting": // NB: delay could be supported for the below settings,
I might be missing something here. Couldn't all setting changes be implemented through events?
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 429 at r4 (raw file):
require.NotZero(t, rangeGen) modes := map[kvserver.LBRebalancingMode]string{
This will be inverted later. Didn't think it through the first time around.
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma_skewed_cpu_skewed_write.txt line 42 at r2 (raw file):
OK print
Need to check if I removed print in the end. I may not have.
pkg/kv/kvserver/asim/tests/default_settings.go line 84 at r15 (raw file):
replicationFactor: defaultReplicationFactor, bytes: defaultBytes, // TODO(tbg): unused, remove.
TODO actually do this.
|
@wenyihu6 this is going to be a bigger one to chew on. I hope the commits are reasonably self-contained. They definitely aren't the direct line: some things I figured out properly only after having done them. I think the amount of backtracking is reasonable, though, so it may not be worth rewriting history. Happy to chat synchronously as well (tomorrow). This PR could be split up if you'd get something out of that; I may do this once I have parts reviewed, depending on how it goes. |
wenyihu6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting! Looks good, only minor comments. Feel free to let me know if I missed anything - some parts might have been outdated by later commits.
Reviewed 2 of 101 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @tbg)
Previously, tbg (Tobias Grieger) wrote…
this will be called "configs" later.
Ack.
Previously, tbg (Tobias Grieger) wrote…
This is stale, since the generated files are excluded (with one exception, example_rebalancing, for illustration). I hadn't yet done that when I first wrote this commit, but later rearranged things.
Ack.
Previously, tbg (Tobias Grieger) wrote…
as in, the files were completely identical, even run under the same config (mma-only) despite the one file name claiming it was running SMA.
I must have broken that sometime in the past.
Ack.
pkg/kv/kvserver/asim/tests/testdata/viewer.html line 4 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Made with Claude Code... I've looked at it a little bit while iterating on it, but I'm definitely not the expert on this. I don't think reviewing this too closely makes sense, better to try it out.
Ack.
pkg/kv/kvserver/asim/tests/testdata/generated/example_rebalancing/example_rebalancing_2_qps.png line at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
These shouldn't have been removed in r2 but in r1. Oh well.
Ack.
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma_high_cpu_unable_to_shed_leases.txt line 36 at r22 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This seems wrong. The node is supposed to be overloaded due to RAFT cpu. Now I turned this into a read-only workload and raft_cpu_per_write isn't even used.
Looks like it used to be
gen_load rate=1000 rw_ratio=0 min_block=0 max_block=0 min_key=0 max_key=10000 raft_cpu_per_write=100
Changed in tbg@46346ad
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma_skewed_cpu_skewed_write.txt line 42 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Need to check if I removed
Looks like your recent commit did it.
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 180 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
reminder: need to update the help text above with the updates.
Ack.
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 429 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This will be inverted later. Didn't think it through the first time around.
Ack.
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 562 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I might be missing something here. Couldn't all setting changes be implemented through events?
I think you’re right - there’s nothing technically wrong with it. I was concerned that a delay of 0 might subtly change behavior since settings are now applied slightly after the simulation starts. It is the first thing we do in a tick
cockroach/pkg/kv/kvserver/asim/asim.go
Line 253 in bd66dc4
| s.tickEvents(ctx, tick) |
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 441 at r20 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It's worth thinking about these configs. Right now I'm lumping the SMA StoreRebalancer into the "sma" variable. It won't ever make sense to turn on both the legacy and MMA store rebalancer at the same time, so so far it seems to work out - there is no reason to want "both store rebalancers on".
Agreed that I don't see a reason having both store rebalancers enabled at the same time. We might switch between them (probably as a delayed event).
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 233 at r45 (raw file):
float64(requestCPUPerAccess)*rwRatio
Should we remove *rwRatio since requestCPUPerAccess applies to both reads and writes (not just reads)?
cockroach/pkg/kv/kvserver/asim/workload/workload.go
Lines 139 to 158 in df2d9b2
| for read := 0; read < reads; read++ { | |
| size := int64(rwg.rand.Intn(rwg.maxSize-rwg.minSize+1) + rwg.minSize) | |
| key := rwg.keyGenerator.readKey() | |
| event := next[key] | |
| event.Reads++ | |
| event.ReadSize += size | |
| event.RequestCPU += rwg.requestCPUPerAccess | |
| next[key] = event | |
| } | |
| for write := 0; write < writes; write++ { | |
| size := int64(rwg.rand.Intn(rwg.maxSize-rwg.minSize+1) + rwg.minSize) | |
| key := rwg.keyGenerator.writeKey() | |
| event := next[key] | |
| event.Writes++ | |
| event.WriteSize += size | |
| event.RequestCPU += rwg.requestCPUPerAccess | |
| event.RaftCPU += rwg.raftCPUPerWrite | |
| next[key] = event | |
| } |
pkg/kv/kvserver/asim/tests/plotting_test.go line 115 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
A later commit will allow passing a list of metrics to
evaland they will be printed here. It's a compromise for now.
Is this comment referring to the fact that we’re wrapping code inside if false blocks here?
pkg/kv/kvserver/asim/tests/testdata/non_rand/example_splitting.txt line 15 at r46 (raw file):
# threshold set above. Also set to use only reads, to avoid size based splits # (we could also disable this via setting the range max size). gen_load rate=10000 rw_ratio=1.0 raft_cpu_per_write=500000
Are we setting raft cpu with a read only workload?
pkg/kv/kvserver/asim/tests/testdata/non_rand/example_splitting.txt line 43 at r46 (raw file):
# number of splits as the uniform workload. However, that is not the case as we # require more splits with a zipfian distribution. gen_load rate=10000 rw_ratio=1.0 access_skew=true replace=true raft_cpu_per_write=500000
Ditto.
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma_constraint_satisfaction1.txt line 6 at r45 (raw file):
# specify zone survivability (2, 2, 1) replicas across 3 regions. # a(n1.n2) b(n3,n4) c(n5) d(n6-n9) gen_cluster nodes=9 region=(a,b,c,d) nodes_per_region=(3,2,1,3) node_cpu_rate_capacity=50000
Did a warning fire for this 50k ns-cpu/sec setting? Seems too low. It’s not part of your PR tho (we can do it in a follow-up unless you want to do it now).
pkg/kv/kvserver/asim/tests/testdata/non_rand/load_distribution_movement_disabled_enable_later.txt line 2 at r38 (raw file):
# Disable all lease and replica movement. setting rebalance_mode=0 replicate_queue_enabled=false lease_queue_enabled=false split_queue_enabled=false
Perhaps this is the intended behavior - feel free to ignore if so. Just noting that sma-only was running here with all other queues disabled originally.
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 592 at r4 (raw file):
require.GreaterOrEqual(t, len(runs[len(runs)-1].hs), sample) h := runs[len(runs)-1].hs[sample-1]
Is this code reachable? iiuc, it also only plots the last simulation run for the last mode? This comment might become irrelevant later since I saw your demo with multiple plots in which case feel free to ignore.
pkg/kv/kvserver/asim/tests/testdata/non_rand/example_skewed_cpu_even_ranges_mma_and_queues.txt line 34 at r22 (raw file):
---- eval duration=35m samples=1 seed=42 cfgs=(mma-only) metrics=(cpu,leases,replicas,write_bytes_per_second)
This might be the intended behavior in which case feel free to ignore me - just noting that purely based on this diff, it used to enable other rebalancing queues (which should be cfg=both)?
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @wenyihu6 from 3 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/asim/tests/default_settings.go line 84 at r15 (raw file):
Previously, tbg (Tobias Grieger) wrote…
TODO actually do this.
Done on next push.
pkg/kv/kvserver/asim/tests/testdata/non_rand/example_skewed_cpu_even_ranges_mma_and_queues.txt line 34 at r22 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
This might be the intended behavior in which case feel free to ignore me - just noting that purely based on this diff, it used to enable other rebalancing queues (which should be cfg=both)?
I'll double check, but at least at the end of this PR we have
example_skewed_cpu_even_ranges_mma.txt
with
eval duration=35m samples=1 seed=42 cfgs=(sma-only,mma-only,both) metrics=(cpu,leases,replicas,write\_bytes\_per\_second)
so we have a version with both at least by the time the dust settles.
pkg/kv/kvserver/asim/tests/testdata/non_rand/example_splitting.txt line 15 at r46 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Are we setting raft cpu with a read only workload?
Not sure where this is coming from 😅 removed with next push.
pkg/kv/kvserver/asim/tests/testdata/non_rand/example_splitting.txt line 43 at r46 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Ditto.
Done.
pkg/kv/kvserver/asim/tests/testdata/non_rand/load_distribution_movement_disabled_enable_later.txt line 2 at r38 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Perhaps this is the intended behavior - feel free to ignore if so. Just noting that sma-only was running here with all other queues disabled originally.
Sorry to slip this in. Yeah, I thought when looking at this that it seemed silly to run a ten node simulation without any activity for the first minutes. Do you understand why the test may originally have looked like that?
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma_constraint_satisfaction1.txt line 6 at r45 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Did a warning fire for this 50k ns-cpu/sec setting? Seems too low. It’s not part of your PR tho (we can do it in a follow-up unless you want to do it now).
Leaving this comment open to fix this up.
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma_high_cpu_unable_to_shed_leases.txt line 36 at r22 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Looks like it used to be
gen_load rate=1000 rw_ratio=0 min_block=0 max_block=0 min_key=0 max_key=10000 raft_cpu_per_write=100Changed in tbg@46346ad
Fixed with next push.
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 180 at r2 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Ack.
Just did a pass and it looked up to date.
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 592 at r4 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Is this code reachable? iiuc, it also only plots the last simulation run for the last mode? This comment might become irrelevant later since I saw your demo with multiple plots in which case feel free to ignore.
No, wasn't reachable at that point, and it's gone by the end of the PR.
pkg/kv/kvserver/asim/tests/datadriven_simulation_test.go line 233 at r45 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
float64(requestCPUPerAccess)*rwRatio
Should we remove *rwRatio since requestCPUPerAccess applies to both reads and writes (not just reads)?
cockroach/pkg/kv/kvserver/asim/workload/workload.go
Lines 139 to 158 in df2d9b2
for read := 0; read < reads; read++ { size := int64(rwg.rand.Intn(rwg.maxSize-rwg.minSize+1) + rwg.minSize) key := rwg.keyGenerator.readKey() event := next[key] event.Reads++ event.ReadSize += size event.RequestCPU += rwg.requestCPUPerAccess next[key] = event } for write := 0; write < writes; write++ { size := int64(rwg.rand.Intn(rwg.maxSize-rwg.minSize+1) + rwg.minSize) key := rwg.keyGenerator.writeKey() event := next[key] event.Writes++ event.WriteSize += size event.RequestCPU += rwg.requestCPUPerAccess event.RaftCPU += rwg.raftCPUPerWrite next[key] = event }
Done w/ next push.
pkg/kv/kvserver/asim/tests/plotting_test.go line 115 at r13 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Is this comment referring to the fact that we’re wrapping code inside
if falseblocks here?
Yeah, pointing out the temporary hack which is gone by the end of this PR.
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @wenyihu6 from a discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/asim/tests/testdata/non_rand/mma_constraint_satisfaction1.txt line 6 at r45 (raw file):
Done with last push.
wenyihu6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/asim/tests/testdata/non_rand/example_skewed_cpu_even_ranges_mma_and_queues.txt line 34 at r22 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'll double check, but at least at the end of this PR we have
example_skewed_cpu_even_ranges_mma.txt
with
eval duration=35m samples=1 seed=42 cfgs=(sma-only,mma-only,both) metrics=(cpu,leases,replicas,write\_bytes\_per\_second)so we have a version with
bothat least by the time the dust settles.
Nice, no change needed then.
pkg/kv/kvserver/asim/tests/testdata/non_rand/load_distribution_movement_disabled_enable_later.txt line 2 at r38 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Sorry to slip this in. Yeah, I thought when looking at this that it seemed silly to run a ten node simulation without any activity for the first minutes. Do you understand why the test may originally have looked like that?
Agreed what you did here makes sense to me. I might have done it this way becaus mma doesn't interact well with other rebalancing queues, so I wanted to ensure it ran with the others disabled. And to get a more isolated comparison of just the store rebalancer between the existing and mma store rebalancer, I just disabled all other rebalancing queues throughout the test. 10 mins without any activity was probably silly tho.
pkg/kv/kvserver/asim/tests/plotting_test.go line 115 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Yeah, pointing out the temporary hack which is gone by the end of this PR.
Makes sense.
150972: orchestration: released CockroachDB version v25.2.3. Next version: v25.2.4 r=celiala a=cockroach-teamcity Release note: None Epic: None Release justification: non-production (release infra) change. 151169: sniffarg: unconditionally convert to double-dash args r=tbg a=tbg The old code was being dumb while trying to be too clever - it tried to do only the minimal amount of rewriting, but really should've just rewritten all the flags to `--format` instead. Found during #150847. Epic: CRDB-25222 Co-authored-by: Justin Beaver <teamcity@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
|
TFTR! I just rebased (needed to push a commit that resolves a diff - probably similar to #151143) and will merge tomorrow after I do a final pass and find everything in good order. |
|
TFTR! bors r+ The CI failures are unrelated to this code. I pinged both issues. |
|
bors r- need another rebase first. |
|
Canceled. |
This turns `plot`, `topology`, `print` into no-ops temporarily so that we can better refactor the `eval` command to run multiple iterations of the test under different configurations.
This introduces the ability to `eval` a given configuration across multiple "modes" (i.e. configurations) such as SMA vs MMA. For now, to keep the change minimal, it runs each test file in a "default" mode which just does exactly what it did before. In follow-up commits, we'll gradually convert test files to select the appropriate mode(s).
We're going to want to set a setting via an event depending on the eval mode, but the static event generator isn't reusable enough. I haven't quite figured out why we need it in the first place (over a slice of events, as introduced here), so potentially it can be removed at some point. For now, I left it alone.
We have since introduced inter-test concurrency, that ought to be enough, for now.
A delay of zero is a fine delay.
Rebrand the previously introduced stub "modes" as "configurations", which is hopefully a more evocative name. Also, prepare them for being able to set cluster settings. We'll use this in upcoming commits.
Port all the tests and remove the plot command.
This leaves only a few tests that manually set rebalance_mode, to be handled in follow-ups.
Save for unintentional drift, they are the same spec, just under different configurations.
Before this fix, the storeID on the graphs started at zero.
Strips the .txt suffix.
It was using two evals that were hard to parse. The second test didn't really do very much. In both, the CPU usage was basically nonexistent. Instead, do a single test that has a physically separated 5 vcpu read and a 10mb/s write workload. Let them run for two minutes under the single-metric allocator, then also enable the multi-metric allocator. It's hard to see from the testdata output, but the charts show that write and CPU both converge, along with lease count, qps, and replica count.
It wasn't enabling any queue, so replicas were just sitting around. This isn't worth the CPU time.
This prints a warning in the tests whenever "unrealistic" CPUs or loads are provisioned. We can work through this list and pick them off one by one.
I also realized some of these were not actually running what the filename claimed. It was all a mess: now there's just one file, running in SMA, MMA, and combined config.
The files were identical in their inputs, except that the deleted file used the `both` config. We run that one in the remaining file now.
The asim tests run in parallel now, and this was erasing a log marker that allows us to separate the log lines by test case.
It was accidentally added to a read-only workload.
It was accidentally turned into a read-only workload.
Writes also consume the configured request access CPU. The code was written under the assumption that only reads would.
We can address them in a separate PR.
Some output changed while rebasing, and the changes are not qualitatively meaningful. This might be similar to cockroachdb#151143. Epic: CRDB-25222
|
bors r+ |
|
🕐 Waiting for PR status (GitHub check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set. |
|
This PR was included in a batch that successfully built, but then failed to merge into master (it was a non-fast-forward update). It will be automatically retried. |

This PR
I think some more work is needed to tell, based on the datadriven files along, whether the configured allocator is "doing a good enough job". But at least there are fewer test scenarios now, and none of them configure obviously bogus loads or capacities.
Epic: CRDB-52631