Skip to content

CI: End-to-end instance launch #1608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 29, 2022
Merged

CI: End-to-end instance launch #1608

merged 9 commits into from
Oct 29, 2022

Conversation

iliana
Copy link
Contributor

@iliana iliana commented Aug 16, 2022

As demoed Friday, here's a test that launches a Debian instance in lab CI and successfully SSHes into it. There's hopefully enough infrastructure here to enable others to write lab tests as well; some things could definitely be moved into mod helpers but I'd like to avoid trying to predict what needs to be moved into helper functions.

Please also review oxidecomputer/sodiumoxide@c2546d1, which are changes to libsodium-sys's build.rs to support illumos. Once this lands I will try to send that upstream so we can stop maintaining a fork (welp, they archived the upstream repo). (libsodium-sys is a required component of thrussh, the library we are using as an SSH client in these tests so that we don't have to instrument the OpenSSH client for tests.)

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

I haven't taken a close look at the test but I like the approach of using a non-default workspace crate for running these tests. It's been in the back of my mind how we'd run more complicated integration tests like this in a way that didn't trip up a standard cargo test, but were still well integrated into the workflows and tooling that people expect (i.e., still runnable by cargo test). This looks like a good way to do that.

@iliana iliana force-pushed the end-to-end-instance-launch branch from 1883391 to 8da2f10 Compare August 19, 2022 22:52
@iliana iliana requested a review from davepacheco August 19, 2022 22:53
Copy link
Contributor

@luqmana luqmana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet this is awesome! It'll be great to have actual end-to-end tests now in addition to the sled-agent-sim ones! Left a couple small comments but otherwise looks good to me

http://catacomb.eng.oxide.computer:12346/tcpproxy
pfexec chmod +x /usr/oxide/tcpproxy
pfexec rm -f /var/svc/manifest/site/tcpproxy.xml
pfexec curl -sSfL -o /var/svc/manifest/site/tcpproxy.xml \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just include tcpproxy.xml in the omicron repo so that we don't run the risk of a change made to the remote copy breaking CI with no in-repo changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the moment I'm avoiding this because I'd rather have the tcpproxy binary and its service manifest in sync in the same place. (But I'm making a note to eventually build tcpproxy in CI perhaps?)

// TODO: not really sure about a good heuristic for selecting an IP address
// range here. in both my (iliana's) environment and the lab, the last octet
// is 20; in my environment the DHCP range is 100-249, and in the buildomat
// lab environment the network is currently private.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carving some space out of the same subnet as Nexus seems alright for now. Besides Nexus (and the global zone) was there anything else we'd conflict with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly worried about home network setups for people running these tests locally. I don't think there's a whole lot I can do there though.

(I've idly been wondering if we should have an out-of-repo config we look for describing a user's network setup.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I've idly been wondering if we should have an out-of-repo config we look for describing a user's network setup.)

Yes! I think that or even just something we stick in the .gitignore. Since the configs are part the repo today any modifications you make for your local testing (e.g. mac gateway) pollute git status.

@iliana
Copy link
Contributor Author

iliana commented Sep 1, 2022

Before I merge, I'm looking into why, after a merge, the tests are now failing locally (bootstrap is timing out) and in CI.

@iliana iliana force-pushed the end-to-end-instance-launch branch from 8da2f10 to 35d8ec6 Compare October 21, 2022 00:29
@zephraph
Copy link
Contributor

I've been working on a series of end-to-end tests which can drive the console against a live instance of Nexus. Once this is in, I'd like to setup a GitHub action to drive those tests against these instances.

It's a bit of a long-tail in terms of CI and not something you'd want to wait on for development. I think it'd be helpful to run it against main at least.

@iliana
Copy link
Contributor Author

iliana commented Oct 21, 2022

That could be interesting — ultimately the long tail in this current test is the amount of time it takes to build and package the control plane (22 min in this run), and then about 5 minutes to deploy the control plane on a lab machine. We do have to stand up the control plane on a lab machine, of which we only have two currently, so there's not a lot of concurrency there; but if we do want to run a separate lab machine that is an optional CI step to run the console end-to-end tests, we certainly could in theory.

@@ -114,3 +116,13 @@ panic = "abort"
[patch.crates-io.pq-sys]
git = 'https://github.com/oxidecomputer/pq-sys'
branch = "oxide/omicron"

#
# libsodium-sys is a dependency of thrussh, used in `end-to-end-tests`. We
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this off github, but for posterity's sake:

  • We wanted libsodium because it's a dependency of thrussh
  • We use thrussh in the end-to-end test
  • Thrussh also looks deprecated

We should at least file an issue to follow-up here, and minimize our use of unsupported deps.

@iliana iliana force-pushed the end-to-end-instance-launch branch from d92d0f3 to 30d39df Compare October 28, 2022 23:46
@iliana iliana enabled auto-merge (squash) October 29, 2022 00:25
@iliana iliana merged commit daf316d into main Oct 29, 2022
@iliana iliana deleted the end-to-end-instance-launch branch October 29, 2022 00:31
leftwo pushed a commit that referenced this pull request Jan 27, 2025
Crucible changes are:
Allow read only activation with less than three downstairs (#1608)
Tweaks to automatic flush (#1613)
Update Rust crate twox-hash to v2 (#1547)
Remove `LastFlushAck` (#1603)
Correctly print 'connecting' state (#1612)
Make live-repair part of invariants checks (#1610)
Simplify mend region selection (#1606)
Generic read test for crutest (#1609)
Always remove skipped jobs from dependencies (#1604)
Add libsqlite3-dev install step to Github Actions CI (#1607)
Move Nexus notification to standalone task (#1584)
DTrace cleanup. (#1602)
Reset completed work Downstairs on a `Barrier` operation (#1601)
Upstairs state machine refactoring (3/3) (#1577)

Propolis changes are:
Wire up initial support for AMD perf counters
build: upgrade tokio to 1.40.0 (#836)
build: explicitly install libsqlite3-dev in CI (#834)
add JSON output format to cpuid-gen (#832)
leftwo added a commit that referenced this pull request Jan 27, 2025
Crucible changes are:
Allow read only activation with less than three downstairs (#1608)
Tweaks to automatic flush (#1613)
Update Rust crate twox-hash to v2 (#1547)
Remove `LastFlushAck` (#1603)
Correctly print 'connecting' state (#1612)
Make live-repair part of invariants checks (#1610) Simplify mend region
selection (#1606)
Generic read test for crutest (#1609)
Always remove skipped jobs from dependencies (#1604) Add libsqlite3-dev
install step to Github Actions CI (#1607) Move Nexus notification to
standalone task (#1584) DTrace cleanup. (#1602)
Reset completed work Downstairs on a `Barrier` operation (#1601)
Upstairs state machine refactoring (3/3) (#1577)

Propolis changes are:
Wire up initial support for AMD perf counters
build: upgrade tokio to 1.40.0 (#836)
build: explicitly install libsqlite3-dev in CI (#834) add JSON output
format to cpuid-gen (#832)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants