Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Add blue/green deployment test #24142

Merged
merged 2 commits into from
Jan 2, 2024

Conversation

def-
Copy link
Contributor

@def- def- commented Dec 28, 2023

Following https://www.notion.so/materialize/Testing-Plan-Blue-Green-Deployments-01528a1eec3b42c3a25d5faaff7a9bf9#f53b51b110b044859bf954afc771c63a

I've seen latencies go to 3 seconds, so hope the 5 seconds limit is ok.

There isn't really a useful check for correctness at the moment.

I'm wondering if the > 60 s when waiting for pending dataflows is acceptable.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@@ -205,7 +205,7 @@ def workflow_test_github_12251(c: Composition) -> None:
c.down(destroy_volumes=True)
c.up("materialized")

start_time = time.process_time()
start_time = time.time()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_time doesn't include the time waiting for Mz to return results: https://docs.python.org/3/library/time.html#time.process_time

@def- def- force-pushed the pr-blue-green-deployment-test branch from ac8e1a0 to e8f6247 Compare December 29, 2023 11:25
@def-
Copy link
Contributor Author

def- commented Dec 29, 2023

I now got the test green by bumping the testdrive timeout to 5 minutes and the load generator interval to 1s, with 100ms was often seeing dataflows pending. I'm wondering if the way we measure dataflows pending doesn't work for cases where data comes faster than once a second.

FOR ALL TABLES
> CREATE MATERIALIZED VIEW prod_deploy.tpch_mv
IN CLUSTER prod_deploy AS
SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may wish to have a dedicated constant column that distinguishes the original view from the swapped-in one. Then check that the value of that column is as expected post-swap.

STORAGE ADDRESSES ['clusterd1:2103'],
COMPUTECTL ADDRESSES ['clusterd1:2101'],
COMPUTE ADDRESSES ['clusterd1:2102'],
WORKERS 1))
Copy link
Contributor

@philip-stoev philip-stoev Dec 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would more workers and/or more storageds make this test more stessful/realistic?

For example, have the two clusters have different sizes/shapes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. My understanding is that one of the motivations is also changing the schema without any downtime. I'll try a different cluster though.

while running:
total_runtime = 0
queries = [
"SELECT * FROM prod.counter_mv",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to direct those SELECTs to a particular cluster, or the default one is good enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding default should be good enough.

threads = [PropagatingThread(target=fn) for fn in (selects, subscribe)]
for thread in threads:
thread.start()
time.sleep(3) # some time to make sure the queries run fine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this interval may need to be longer. Or, alternatively, some check is needed to confirm that every type of query ran at least once post-swap.

# by the Apache License, Version 2.0.

> CREATE SCHEMA prod_deploy
> CREATE SOURCE prod_deploy.counter IN CLUSTER prod_deploy FROM LOAD GENERATOR counter (TICK INTERVAL '1s')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a sub-second TICK INTERVAL here, like 0.01s, to make sure stuff actively happens in the system all the time while the swap is in progress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this before and it always led to the pending dataflows never becoming considered ready, since their lag would stay at > 1 s. I'm not sure if this is a limitation of Mz or the way that we check the pending dataflows.

@benesch benesch removed their request for review December 31, 2023 17:22
@benesch
Copy link
Member

benesch commented Dec 31, 2023

Won't have time to review, sorry!

@def- def- force-pushed the pr-blue-green-deployment-test branch from e8f6247 to 323a66b Compare January 2, 2024 11:25
@def- def- merged commit a4bd149 into MaterializeInc:main Jan 2, 2024
12 checks passed
@def- def- deleted the pr-blue-green-deployment-test branch January 2, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants