-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Add blue/green deployment test #24142
Conversation
@@ -205,7 +205,7 @@ def workflow_test_github_12251(c: Composition) -> None: | |||
c.down(destroy_volumes=True) | |||
c.up("materialized") | |||
|
|||
start_time = time.process_time() | |||
start_time = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process_time doesn't include the time waiting for Mz to return results: https://docs.python.org/3/library/time.html#time.process_time
ac8e1a0
to
e8f6247
Compare
I now got the test green by bumping the testdrive timeout to 5 minutes and the load generator interval to 1s, with 100ms was often seeing dataflows pending. I'm wondering if the way we measure dataflows pending doesn't work for cases where data comes faster than once a second. |
FOR ALL TABLES | ||
> CREATE MATERIALIZED VIEW prod_deploy.tpch_mv | ||
IN CLUSTER prod_deploy AS | ||
SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may wish to have a dedicated constant column that distinguishes the original view from the swapped-in one. Then check that the value of that column is as expected post-swap.
STORAGE ADDRESSES ['clusterd1:2103'], | ||
COMPUTECTL ADDRESSES ['clusterd1:2101'], | ||
COMPUTE ADDRESSES ['clusterd1:2102'], | ||
WORKERS 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would more workers and/or more storageds make this test more stessful/realistic?
For example, have the two clusters have different sizes/shapes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. My understanding is that one of the motivations is also changing the schema without any downtime. I'll try a different cluster though.
while running: | ||
total_runtime = 0 | ||
queries = [ | ||
"SELECT * FROM prod.counter_mv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to direct those SELECTs to a particular cluster, or the default
one is good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding default should be good enough.
test/cluster/mzcompose.py
Outdated
threads = [PropagatingThread(target=fn) for fn in (selects, subscribe)] | ||
for thread in threads: | ||
thread.start() | ||
time.sleep(3) # some time to make sure the queries run fine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this interval may need to be longer. Or, alternatively, some check is needed to confirm that every type of query ran at least once post-swap.
# by the Apache License, Version 2.0. | ||
|
||
> CREATE SCHEMA prod_deploy | ||
> CREATE SOURCE prod_deploy.counter IN CLUSTER prod_deploy FROM LOAD GENERATOR counter (TICK INTERVAL '1s') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a sub-second TICK INTERVAL here, like 0.01s, to make sure stuff actively happens in the system all the time while the swap is in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this before and it always led to the pending dataflows never becoming considered ready, since their lag would stay at > 1 s. I'm not sure if this is a limitation of Mz or the way that we check the pending dataflows.
Won't have time to review, sorry! |
e8f6247
to
323a66b
Compare
Following https://www.notion.so/materialize/Testing-Plan-Blue-Green-Deployments-01528a1eec3b42c3a25d5faaff7a9bf9#f53b51b110b044859bf954afc771c63a
I've seen latencies go to 3 seconds, so hope the 5 seconds limit is ok.
There isn't really a useful check for correctness at the moment.
I'm wondering if the > 60 s when waiting for pending dataflows is acceptable.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.