-
Notifications
You must be signed in to change notification settings - Fork 63
add background tasks for propagating DNS #2797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Sorry, I'm going to have to rebase and force push this one because even though GitHub was showing the correct diff before, now it's showing the wrong thing again. |
35fcce4 to
ee3c964
Compare
| ); | ||
| // Begin starting Nexus. | ||
| let (nexus_internal, nexus_internal_addr) = | ||
| N::start_internal(&config, &logctx.log).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to split start_and_populate() into two steps here. Previously, this would start Nexus, do rack initialization, and wait for that to finish. But rack initialization for the tests now expects the configuration for the DNS servers. That's not known until we start the DNS servers. That happens when we start the simulated sled agent. That requires knowing Nexus's internal IP...which we can't know until we start it. There was a circular dependency here. So I split this into two steps: one starts Nexus so we can grab its internal API and start the simulated sled agent, and the next step does rack initialization and finishes starting Nexus.
This is also the reason for the changes to InternalServer above. I tried to keep it borrowing the Config, but ran into a rustc panic (to-be-filed), so I just had it copy the config so that it's now 'static.
|
I think this is finally ready for review! It looks big but I think probably half of the code are tests and another good chunk are docs. |
smklein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really impressed by both:
- How complex of a problem this is, and
- How elegant of a solution this is
The API is clear, the activation is straightforward, and I can see how the dependency pattern lets us avoid re-doing unnecessary work rather efficiently, by caching inputs through the watchers, and only activating "the tasks that need activating".
Thanks for taking this on! I'm excited to use it.
nexus/src/app/background/common.rs
Outdated
| //! | ||
| //! * providing a way for Nexus at-large to activate your task | ||
| //! * activating your task periodically | ||
| //! * ensuring that the task is activated only once at a time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, that's "only once at a time" per Nexus, right? So if multiple Nexus instances exist, they'll all do the task concurrently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in chat: this is intentional, but IMO we should clarify this bit -- callers should not expect mutual exclusivity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Clarified in ba037e0.
| fn activate<'a, 'b, 'c>( | ||
| &'a mut self, | ||
| opctx: &'b OpContext, | ||
| ) -> BoxFuture<'c, serde_json::Value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale behind letting background tasks emit output data? Just debugging?
Given that tasks can trigger other tasks, I wanted to confirm there's no "feed-forward" data, since it seems like that would violate the intended "no arguments" property for these tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, after reading through the rest of the code, I ironically have found the opposite to be true.
It seems like a really common pattern is:
- Tasks exist in-memory, and may be executed periodically
- Tasks often have dependencies
- Tasks often communicate using the tokio watch primitive, as a way to grab "whatever was the last value we saw from the prior task" to avoid repeating unnecessary work
I wonder if we want to make this output value a bit more first-class, and have tasks emit values that can be watched by downstream consumers?
I don't actually think we should implement this now, but it's food for thought as we get more consumers of the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good points. It's pretty subtle trying to distinguish between "inputs that represent hints/shortcuts about what we're supposed to do" (which we're trying to avoid) and "inputs that represent the full canonical state, which just happens to have come from another task instead of the database [just because it's cleaner to factor things that way]".
nexus/src/app/background/common.rs
Outdated
| wait_until_count(rx1.clone(), 4).await; | ||
| assert_eq!(*rx1.borrow(), 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value is still counting here -- couldn't it bump up to 5 between these lines?
(I guess it's unlikely we context switch away for 100ms here...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's right. I loosened this check in ba037e0. I don't think the other checks are affected because the one at L633 already accounts for this and the other checks use a task with a long period that we expect won't fire during the test.
| assert!(result[1].is_ok()); | ||
| s1.verify_and_clear(); | ||
| s2.verify_and_clear(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(No action item) I'm finding these tests exceptionally readable - the combination of the API you made + httptest is chef's kiss.
davepacheco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
nexus/src/app/background/common.rs
Outdated
| //! | ||
| //! * providing a way for Nexus at-large to activate your task | ||
| //! * activating your task periodically | ||
| //! * ensuring that the task is activated only once at a time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Clarified in ba037e0.
nexus/src/app/background/common.rs
Outdated
| wait_until_count(rx1.clone(), 4).await; | ||
| assert_eq!(*rx1.borrow(), 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's right. I loosened this check in ba037e0. I don't think the other checks are affected because the one at L633 already accounts for this and the other checks use a task with a long period that we expect won't fire during the test.
(re-do of #2753, this version depends on #2796)
This change adds DNS propagation to Nexus, by adding:
See also: RFD 367, RFD 373
Still to-do on this: