Skip to content

Conversation

estroz
Copy link
Member

@estroz estroz commented Dec 10, 2021

Description of the change: parallelized e2e tests (for upstream CI only)

Motivation for the change: current sequential runs are ~1h10m. the longest parallel run is ~24 min, and could be lower if we increase the matrix size. you will hit Amdahl's law quickly from test setup. it's a little hacky but adjusts automatically when new specs are added/removed.

I'll add some tests next week

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive

@openshift-ci openshift-ci bot requested review from ankitathomas and exdx December 10, 2021 23:42
@estroz estroz force-pushed the chore/parallelize-ci-e2e branch from 3d7f26d to b6c047a Compare December 11, 2021 00:48
@estroz
Copy link
Member Author

estroz commented Dec 11, 2021

This was tested on my private fork, see https://github.com/estroz/operator-lifecycle-manager/actions/runs/1565786458

@estroz estroz force-pushed the chore/parallelize-ci-e2e branch from b6c047a to 5175e23 Compare December 11, 2021 01:02
@timflannagan
Copy link
Member

This is really, really (albeit a bit hacky) cool work man - this will infinitely improve the current state of long, frustrating e2e feedback. I just had a couple of questions, and I'd like to land anything that substantially improves the e2e feedback, but we'll need to update the branch protection for our tide configuration (or remove tide entirely which I think is my preference).

@estroz
Copy link
Member Author

estroz commented Dec 14, 2021

@timflannagan the main things I want to add before this merges is tests for the script, and some deduplication by prefix (if both Subscriptions and Subscripts blah blah exist, the latter will be run twice if they are in different chunks). I'll be working on that this week so this can get in asap.

@estroz estroz requested a review from timflannagan December 14, 2021 21:11
@estroz estroz force-pushed the chore/parallelize-ci-e2e branch 3 times, most recently from 64f79c0 to e77482c Compare December 15, 2021 00:18
@timflannagan
Copy link
Member

I think the overall changes are reasonable to me looking back at that this but my brain is pretty much mush at this point. I'd like to land these changes ASAP but we'd also need to land a tide configuration update as well. I should be back to full capacity by tomorrow.

Comment on lines +199 to +207
// wordTrie is a trie of word nodes, instead of individual characters.
type wordTrie map[string]*wordTrieNode

type wordTrieNode struct {
word string
children map[string]*wordTrieNode
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you settled on a trie implementation vs. a more primitive implementation of sorting test specs and chunking arbitrarily?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primarily to prevent one spec being run by two different chunks. Imagine running two specs defined by phrases "Foo" and "Foo bar" naively in two chunks: the first chunk (^Foo.+) would run both specs, while the second chunk (^Foo bar.+) would run only the latter spec. By using a word trie, the phrases are reduced down to just the word "Foo", which will run both specs in one chunk. At first the downside appears to be a less even distribution of chunk runtimes, since "Foo" and "Foo bar" will always be run together; however this downside also affects naive chunking because spec phrase prefixes are still shared.

Copy link
Collaborator

@perdasilva perdasilva Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a really naive question, but, why not just use Ginkgo's parallelization: e.g. ginkgo -procs=N?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-procs gives you core parallelism, the actions matrix + chunks gives you core and runner parallelism. Runner parallelism is important here because we cannot choose machine size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also important to note that the way our CI environment is setup, we create a kind cluster everytime we spawn a new e2e process, so this PR simply builds on top of that existing setup and chunks test specs and feeds those test specs as focus regex groups that an individual kind cluster will then run. There's likely a cleaner solution going forward, but we'd have to shift around some of our current setup to get it working correctly.

@estroz estroz force-pushed the chore/parallelize-ci-e2e branch from e77482c to 2e300ee Compare January 4, 2022 20:22
@estroz estroz requested review from timflannagan and njhale January 4, 2022 20:22
@estroz estroz force-pushed the chore/parallelize-ci-e2e branch from 2e300ee to 2d4506c Compare January 5, 2022 17:48
@estroz estroz requested a review from timflannagan January 5, 2022 17:49
@estroz estroz requested a review from perdasilva January 5, 2022 17:49
@timflannagan
Copy link
Member

timflannagan commented Jan 5, 2022

I think I'm comfortable enough with the implementation to avoid holding up this PR with any further comments given it helps improve QoL quite a bit. I would like to see us open an issue around aggregating the debug/junit/etc. artifacts into a single source, but that can be done asynchronously anyways.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 5, 2022
@timflannagan
Copy link
Member

Interesting, the extra info-level logging is showing up as errors in the run output:

Run make e2e-local E2E_TEST_CHUNK=1 E2E_TEST_NUM_CHUNKS=4 E2E_NODES=2 ARTIFACTS_DIR=./artifacts/
Error: 2022/01/05 20:31:52 test/e2e/e2e_test.go: found no top level describes, skipping
Error: 2022/01/05 20:31:52 test/e2e/like_metric_matcher_test.go: found no top level describes, skipping
Error: 2022/01/05 20:31:52 test/e2e/setup_bare_test.go: found no top level describes, skipping
...

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 9, 2022
@timflannagan timflannagan force-pushed the chore/parallelize-ci-e2e branch from 2d4506c to abde1b0 Compare January 18, 2022 02:26
@openshift-ci openshift-ci bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 18, 2022
@timflannagan timflannagan force-pushed the chore/parallelize-ci-e2e branch 7 times, most recently from 20e7adb to 5121b9e Compare January 18, 2022 03:49
estroz and others added 3 commits January 18, 2022 10:40
Signed-off-by: Eric Stroczynski <ericstroczynski@gmail.com>
Signed-off-by: Eric Stroczynski <ericstroczynski@gmail.com>
Signed-off-by: timflannagan <timflannagan@gmail.com>
@timflannagan
Copy link
Member

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 18, 2022
@openshift-ci
Copy link

openshift-ci bot commented Jan 18, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: estroz, timflannagan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 18, 2022
@openshift-merge-robot openshift-merge-robot merged commit 989690f into operator-framework:master Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants