add background tasks for propagating DNS #2797

davepacheco · 2023-04-10T18:58:01Z

(re-do of #2753, this version depends on #2796)

This change adds DNS propagation to Nexus, by adding:

a small subsystem in Nexus for running background tasks, which are activated on-demand and periodically
a background task for monitoring the DNS configuration in CockroachDB
a background task for monitoring the list of DNS servers in CockroachDB
a background task for propagating the DNS configuration to the list of DNS servers

See also: RFD 367, RFD 373

Still to-do on this:

add a bunch more tests

davepacheco · 2023-04-12T16:32:12Z

Sorry, I'm going to have to rebase and force push this one because even though GitHub was showing the correct diff before, now it's showing the wrong thing again.

…ialization

davepacheco · 2023-04-12T16:45:55Z

nexus/test-utils/src/lib.rs

-    );
+    // Begin starting Nexus.
+    let (nexus_internal, nexus_internal_addr) =
+        N::start_internal(&config, &logctx.log).await;


I had to split start_and_populate() into two steps here. Previously, this would start Nexus, do rack initialization, and wait for that to finish. But rack initialization for the tests now expects the configuration for the DNS servers. That's not known until we start the DNS servers. That happens when we start the simulated sled agent. That requires knowing Nexus's internal IP...which we can't know until we start it. There was a circular dependency here. So I split this into two steps: one starts Nexus so we can grab its internal API and start the simulated sled agent, and the next step does rack initialization and finishes starting Nexus.

This is also the reason for the changes to InternalServer above. I tried to keep it borrowing the Config, but ran into a rustc panic (to-be-filed), so I just had it copy the config so that it's now 'static.

davepacheco · 2023-04-14T00:15:04Z

I think this is finally ready for review! It looks big but I think probably half of the code are tests and another good chunk are docs.

smklein

I'm really impressed by both:

How complex of a problem this is, and
How elegant of a solution this is

The API is clear, the activation is straightforward, and I can see how the dependency pattern lets us avoid re-doing unnecessary work rather efficiently, by caching inputs through the watchers, and only activating "the tasks that need activating".

Thanks for taking this on! I'm excited to use it.

smklein · 2023-04-14T21:08:23Z

nexus/src/app/background/common.rs

+//!
+//! * providing a way for Nexus at-large to activate your task
+//! * activating your task periodically
+//! * ensuring that the task is activated only once at a time


To be clear, that's "only once at a time" per Nexus, right? So if multiple Nexus instances exist, they'll all do the task concurrently?

As discussed in chat: this is intentional, but IMO we should clarify this bit -- callers should not expect mutual exclusivity.

Good call. Clarified in ba037e0.

smklein · 2023-04-14T21:20:10Z

nexus/src/app/background/common.rs

+    fn activate<'a, 'b, 'c>(
+        &'a mut self,
+        opctx: &'b OpContext,
+    ) -> BoxFuture<'c, serde_json::Value>


What's the rationale behind letting background tasks emit output data? Just debugging?

Given that tasks can trigger other tasks, I wanted to confirm there's no "feed-forward" data, since it seems like that would violate the intended "no arguments" property for these tasks.

Okay, after reading through the rest of the code, I ironically have found the opposite to be true.

It seems like a really common pattern is:

Tasks exist in-memory, and may be executed periodically

Tasks often have dependencies

Tasks often communicate using the tokio watch primitive, as a way to grab "whatever was the last value we saw from the prior task" to avoid repeating unnecessary work

I wonder if we want to make this output value a bit more first-class, and have tasks emit values that can be watched by downstream consumers?

I don't actually think we should implement this now, but it's food for thought as we get more consumers of the API

Yeah, good points. It's pretty subtle trying to distinguish between "inputs that represent hints/shortcuts about what we're supposed to do" (which we're trying to avoid) and "inputs that represent the full canonical state, which just happens to have come from another task instead of the database [just because it's cleaner to factor things that way]".

smklein · 2023-04-14T21:22:08Z

nexus/src/app/background/common.rs

+        wait_until_count(rx1.clone(), 4).await;
+        assert_eq!(*rx1.borrow(), 4);


The value is still counting here -- couldn't it bump up to 5 between these lines?

(I guess it's unlikely we context switch away for 100ms here...)

I think that's right. I loosened this check in ba037e0. I don't think the other checks are affected because the one at L633 already accounts for this and the other checks use a task with a long period that we expect won't fire during the test.

nexus/src/app/background/common.rs

smklein · 2023-04-14T22:38:09Z

nexus/src/app/background/dns_propagation.rs

+        assert!(result[1].is_ok());
+        s1.verify_and_clear();
+        s2.verify_and_clear();
+    }


(No action item) I'm finding these tests exceptionally readable - the combination of the API you made + httptest is chef's kiss.

davepacheco

Thank you!

davepacheco · 2023-04-14T23:15:54Z

nexus/src/app/background/common.rs

+//!
+//! * providing a way for Nexus at-large to activate your task
+//! * activating your task periodically
+//! * ensuring that the task is activated only once at a time


Good call. Clarified in ba037e0.

davepacheco · 2023-04-14T23:16:19Z

nexus/src/app/background/common.rs

+        wait_until_count(rx1.clone(), 4).await;
+        assert_eq!(*rx1.borrow(), 4);


I think that's right. I loosened this check in ba037e0. I don't think the other checks are affected because the one at L633 already accounts for this and the other checks use a task with a long period that we expect won't fire during the test.

nexus/src/app/background/common.rs

davepacheco mentioned this pull request Apr 10, 2023

add background tasks for propagating DNS #2753

Closed

Base automatically changed from services-with-ports to main April 11, 2023 15:55

davepacheco added 3 commits April 12, 2023 09:35

add background tasks for propagating DNS

765f990

WIP: compiler panic

b395959

(part 2 of 2) add test that Nexus propagates DNS on startup/rack init…

ee3c964

…ialization

davepacheco force-pushed the dap/dns-background branch from 35fcce4 to ee3c964 Compare April 12, 2023 16:39

davepacheco commented Apr 12, 2023

View reviewed changes

davepacheco added 4 commits April 12, 2023 09:48

Merge remote-tracking branch 'origin/main' into dap/dns-background

7791a83

driver tests

7fc0f3a

lots more tests

aeadf57

Merge branch 'main' into dap/dns-background

92b5e75

davepacheco requested review from bnaecker and smklein April 14, 2023 00:14

davepacheco marked this pull request as ready for review April 14, 2023 00:15

smklein self-assigned this Apr 14, 2023

smklein approved these changes Apr 14, 2023

View reviewed changes

review feedback

ba037e0

davepacheco commented Apr 14, 2023

View reviewed changes

Merge branch 'main' into dap/dns-background

961861d

davepacheco enabled auto-merge (squash) April 14, 2023 23:34

davepacheco merged commit a378cda into main Apr 15, 2023

davepacheco deleted the dap/dns-background branch April 15, 2023 00:41

leftwo mentioned this pull request Aug 21, 2023

Need API and support in Omicron to replace a crucible downstairs. #3886

Closed

		wait_until_count(rx1.clone(), 4).await;
		assert_eq!(*rx1.borrow(), 4);

add background tasks for propagating DNS #2797

add background tasks for propagating DNS #2797

Uh oh!

Conversation

davepacheco commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Apr 12, 2023

Uh oh!

davepacheco Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Apr 14, 2023

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davepacheco commented Apr 10, 2023 •

edited

Loading

davepacheco Apr 12, 2023 •

edited

Loading