Skip to content

Conversation

@davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Apr 10, 2023

(re-do of #2753, this version depends on #2796)

This change adds DNS propagation to Nexus, by adding:

  • a small subsystem in Nexus for running background tasks, which are activated on-demand and periodically
  • a background task for monitoring the DNS configuration in CockroachDB
  • a background task for monitoring the list of DNS servers in CockroachDB
  • a background task for propagating the DNS configuration to the list of DNS servers

See also: RFD 367, RFD 373

Still to-do on this:

  • add a bunch more tests

Base automatically changed from services-with-ports to main April 11, 2023 15:55
@davepacheco
Copy link
Collaborator Author

Sorry, I'm going to have to rebase and force push this one because even though GitHub was showing the correct diff before, now it's showing the wrong thing again.

);
// Begin starting Nexus.
let (nexus_internal, nexus_internal_addr) =
N::start_internal(&config, &logctx.log).await;
Copy link
Collaborator Author

@davepacheco davepacheco Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to split start_and_populate() into two steps here. Previously, this would start Nexus, do rack initialization, and wait for that to finish. But rack initialization for the tests now expects the configuration for the DNS servers. That's not known until we start the DNS servers. That happens when we start the simulated sled agent. That requires knowing Nexus's internal IP...which we can't know until we start it. There was a circular dependency here. So I split this into two steps: one starts Nexus so we can grab its internal API and start the simulated sled agent, and the next step does rack initialization and finishes starting Nexus.

This is also the reason for the changes to InternalServer above. I tried to keep it borrowing the Config, but ran into a rustc panic (to-be-filed), so I just had it copy the config so that it's now 'static.

@davepacheco davepacheco requested review from bnaecker and smklein April 14, 2023 00:14
@davepacheco
Copy link
Collaborator Author

I think this is finally ready for review! It looks big but I think probably half of the code are tests and another good chunk are docs.

@davepacheco davepacheco marked this pull request as ready for review April 14, 2023 00:15
@smklein smklein self-assigned this Apr 14, 2023
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really impressed by both:

  1. How complex of a problem this is, and
  2. How elegant of a solution this is

The API is clear, the activation is straightforward, and I can see how the dependency pattern lets us avoid re-doing unnecessary work rather efficiently, by caching inputs through the watchers, and only activating "the tasks that need activating".

Thanks for taking this on! I'm excited to use it.

//!
//! * providing a way for Nexus at-large to activate your task
//! * activating your task periodically
//! * ensuring that the task is activated only once at a time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, that's "only once at a time" per Nexus, right? So if multiple Nexus instances exist, they'll all do the task concurrently?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in chat: this is intentional, but IMO we should clarify this bit -- callers should not expect mutual exclusivity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Clarified in ba037e0.

fn activate<'a, 'b, 'c>(
&'a mut self,
opctx: &'b OpContext,
) -> BoxFuture<'c, serde_json::Value>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind letting background tasks emit output data? Just debugging?

Given that tasks can trigger other tasks, I wanted to confirm there's no "feed-forward" data, since it seems like that would violate the intended "no arguments" property for these tasks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after reading through the rest of the code, I ironically have found the opposite to be true.

It seems like a really common pattern is:

  • Tasks exist in-memory, and may be executed periodically
  • Tasks often have dependencies
  • Tasks often communicate using the tokio watch primitive, as a way to grab "whatever was the last value we saw from the prior task" to avoid repeating unnecessary work

I wonder if we want to make this output value a bit more first-class, and have tasks emit values that can be watched by downstream consumers?

I don't actually think we should implement this now, but it's food for thought as we get more consumers of the API

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good points. It's pretty subtle trying to distinguish between "inputs that represent hints/shortcuts about what we're supposed to do" (which we're trying to avoid) and "inputs that represent the full canonical state, which just happens to have come from another task instead of the database [just because it's cleaner to factor things that way]".

Comment on lines 620 to 621
wait_until_count(rx1.clone(), 4).await;
assert_eq!(*rx1.borrow(), 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is still counting here -- couldn't it bump up to 5 between these lines?

(I guess it's unlikely we context switch away for 100ms here...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's right. I loosened this check in ba037e0. I don't think the other checks are affected because the one at L633 already accounts for this and the other checks use a task with a long period that we expect won't fire during the test.

assert!(result[1].is_ok());
s1.verify_and_clear();
s2.verify_and_clear();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action item) I'm finding these tests exceptionally readable - the combination of the API you made + httptest is chef's kiss.

Copy link
Collaborator Author

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

//!
//! * providing a way for Nexus at-large to activate your task
//! * activating your task periodically
//! * ensuring that the task is activated only once at a time
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Clarified in ba037e0.

Comment on lines 620 to 621
wait_until_count(rx1.clone(), 4).await;
assert_eq!(*rx1.borrow(), 4);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's right. I loosened this check in ba037e0. I don't think the other checks are affected because the one at L633 already accounts for this and the other checks use a task with a long period that we expect won't fire during the test.

@davepacheco davepacheco enabled auto-merge (squash) April 14, 2023 23:34
@davepacheco davepacheco merged commit a378cda into main Apr 15, 2023
@davepacheco davepacheco deleted the dap/dns-background branch April 15, 2023 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants