Skip to content

Async Workers #446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jun 8, 2025
Merged

Async Workers #446

merged 20 commits into from
Jun 8, 2025

Conversation

mxgrey
Copy link
Collaborator

@mxgrey mxgrey commented Dec 12, 2024

This PR is the final culmination of all the async execution features that I currently have plans to implement. This builds on top of #421 so there is no need to review this PR until that one is merged, but I'm opening this PR to help make it visible to anyone who wants to try out the full breadth of async features that I've been working on.

The key role of this PR is to introduce the worker concept that was initially proposed in these slides. The API of this implementation has some differences from the initial proposal, but the concept is exactly the same.

Since this is just a draft preview PR, in lieu of a detailed explanation, I'll direct folks to this slide deck which explains the motivation behind the Worker concept.

@romainreignier
Copy link
Contributor

Hi, these changes looks amazing!
I am struggling to use the current API in a real world node in which I need to wait for several topics to trigger other actions like async client calls or timer timeout. It is very different than how I used to do it in C++.
From the slides, it seems that the Workers would help me organize better the Rust code.
What is the current roadmap to get Workers merged?

@mxgrey
Copy link
Collaborator Author

mxgrey commented Mar 10, 2025

I am struggling to use the current API

Just to make sure I'm understanding correctly, when you say "the current API" you mean the API that's currently on the main branch, right?

If there are any potential changes to the API of this PR that can further help with your use case, I'm happy to iterate on it.

As for roadmap / timeline, there will be a working group meeting in about two hours, and we'll probably be talking about next steps there.

I think in general if anyone (including interested users, rather than maintainers) is able to leave reviews on #429 and #430 that may expedite the progress towards getting them merged.

@romainreignier
Copy link
Contributor

Thanks for your answer.

you mean the API that's currently on the main branch, right?

Yes, I was referencing the main branch.

I might try to use this PR on my node to see if the API helps. For now, I was using the branch with the timers because I often use timers in ROS nodes.

mxgrey and others added 15 commits May 12, 2025 12:02
Signed-off-by: Michael X. Grey <greyxmike@gmail.com>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
@mxgrey mxgrey marked this pull request as ready for review May 14, 2025 15:17
@mxgrey mxgrey mentioned this pull request May 14, 2025
@mxgrey
Copy link
Collaborator Author

mxgrey commented May 14, 2025

@esteve This PR should be properly ready for review now.

To clean up the version history, I've squashed the original set of commits and rebased them onto main.

I've also added a considerable amount of documentation throughout the modules, including usage examples and doctests. The root module now has a quick start guide with simple examples of basic usage, parameters, and workers (mostly lifted from the examples directory).

Timers have been moved into #480 . @romainreignier If you were relying on this branch for timers, you can retarget to the branch of #480.

mxgrey added 3 commits May 14, 2025 15:36
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
mxgrey added 2 commits May 14, 2025 16:26
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
Signed-off-by: Michael X. Grey <mxgrey@intrinsic.ai>
@mxgrey mxgrey mentioned this pull request May 15, 2025
Copy link
Collaborator

@maspe36 maspe36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've given this a look and without becoming an async expert myself I feel as good as I can about the changes. I think its high time we move forward here and evolve rclrs.

///
/// Consider using [`Self::notify_on_service_ready`] if you want to wait
/// until a service for this client is ready.
pub fn service_is_ready(&self) -> Result<bool, RclrsError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this function was moved (and likely, outside the scope of this PR), but having a bool as a success type feels weird. Having two ways for a function to fail is a smell to me. What if we only returned a bool but we map all RclrsErrors we see into bools?

Glancing at the docs and the impl it seems like rclrs's design should prevent this from realistically happening unless something in the rmw layer returns something unexpected

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and more generally I've been disappointed by how many places the rclrs API tosses Results at users. I think it will give users the impression that our API is more brittle than it really is.

What I'd like to propose is that for any function with the following criteria:

  • There is no practical risk of an error happening that would be caused by user input
  • There is a reasonable fallback behavior we can follow in the unlikely event of an error
    • "Reasonable" may include panicking for situations that indicate an unrecoverable rmw or rcl error, e.g. failure to create a subscription/publisher

we provide two API functions, one that starts with try_ and one without. The one with try_ will return a Result as our current API does, and the one without will use the fallback behavior in the event of an error.

In this case it would look like

pub fn service_is_ready(&self) -> bool {
    match self.try_service_is_ready() {
        Ok(ready) => return ready,
        Err(err) => {
            log_error!(
                "rclrs.client.service_is_ready",
                "An error occurred: {err}
            );
            return false;
        }
    };
}
pub fn try_service_is_ready(&self) -> Result<bool, RclrsError> { ... }

We can imagine doing likewise for create_subscription / try_create_subscription and the other primitive creation methods.

For this PR I'll just introduce this concept for service_is_ready, but I'll start putting together a proposal on this for the next working group meeting, and we can consider introducing this change in a follow-up PR.

// as the Node that started it is alive.
pub(super) async fn node_graph_task(
mut receiver: UnboundedReceiver<NodeGraphAction>,
#[allow(unused)] guard_condition: GuardCondition,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding, why do this instead of _: GuardCondition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly a matter of opinion, but I think it's better to explicitly mark a field as being something is meant to go unused to communicate very clearly that this is not a mistake or caused by a lazy refactor.

In some cases the reason it goes unused may be obvious, like if a method in a trait definition has some extra arguments you don't need, then it's obvious to a reviewer that you can't just remove the arguments as that would cause a compilation error. In those cases I don't mind using the leading underscore trick.

But this case that you're commenting on is more subtle. The code would compile if you removed the guard_condition argument, but it would stop working correctly because the lifecycle of the guard_condition argument has some side-effects. That's why I want to make sure readers receive an extremely clear signal that the argument is there on purpose despite it going unused.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any of these lifecycle/guard_condition arguments leak to consumers of rclrs? Or are they all internal to our API?

/// If you need high-throughput multi-threaded execution, then consider using
/// a different executor.
//
// TODO(@mxgrey): Implement a multi-threaded executor using tokio in a downstream
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about different executors that required different deps being behind some cargo features?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment that I left here is outdated. These days I am more inclined to do what you're suggesting: Put a tokio-based executor behind a feature flag. I think either would work, but the feature flags make some aspects of maintenance easier.

The most important thing is making sure that thirdparty executors can be made outside of the rclrs crate, which I believe they currently can.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'm also interested in writing a real-time executor at some point, so glad to see the foresight here :)

#[allow(unused)]
node: Node,
lifecycle: WaitableLifecycle,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm getting lost in the sauce on this large PR, but why is this still unused?

Maybe a missing comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these lifecycle fields do nothing but carry a side-effect to remove this primitive from its waitset when the user drops all references to the primitive.

The docs for WaitableLifecycle mention this, but I can add a comment directly onto each of these fields to reinforce it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have comments in all of the other call-sites describing it
https://github.com/ros2-rust/ros2_rust/pull/446/files/8f898601365c7ac8166393a9a25d35130144e240#diff-1d9414c53c73d7dac3b591eea05198c58c6099cdc60255148c77332c1dc8e6f8R29

Could be worth the maintenance cost considering how subtle it really is.

@@ -95,10 +95,10 @@ where
/// Creates a new `Publisher`.
///
/// Node and namespace changes are always applied _before_ topic remapping.
pub(crate) fn new<'a>(
node_handle: Arc<NodeHandle>,
pub(crate) fn create<'a>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why the name change from new to create? Also, since this can fail, should we add a try_* prefix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very opinionated thing on my part, but I think many Rust users will be accustomed to both:

  • fn new being the public-facing method for creating an instance of an object
  • fn new usually returns a Self, not usually a Arc<Self> much less a Result<Arc<Self>, _>

I don't have any strong objection to using the name new here, but the above two points were my rationale for changing it.

I wouldn't mind adding a try_ prefix. Coincidentally that would go well with this comment that I wrote before reading your suggestion here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to leave the function name as create, actually. I like your rationale — it's similar to what we do at work with our "rusty C++" :D

A quick search shows that Tokio does this in a few places too:
https://docs.rs/tokio/latest/tokio/?search=create

Though they're not entirely consistent:
https://docs.rs/tokio/latest/tokio/io/unix/struct.AsyncFd.html#method.try_new

@maspe36
Copy link
Collaborator

maspe36 commented May 31, 2025

If there are no additional comments from the other maintainers this time next week I will merge this PR

@maspe36 maspe36 requested review from esteve and jhdcs May 31, 2025 15:24
@maspe36 maspe36 merged commit d62583d into ros2-rust:main Jun 8, 2025
4 checks passed
@maspe36
Copy link
Collaborator

maspe36 commented Jun 8, 2025

Hmmm I forgot about the new windows CI. We are currently failing that now. I'm on vacation for the week as well. Sorry about that folks.

@mxgrey
Copy link
Collaborator Author

mxgrey commented Jun 9, 2025

I'm not sure how the current failure in the Windows CI would be caused by this PR 🤔

Screenshot from 2025-06-09 14-36-23

@knmcguire
Copy link
Contributor

Weirdly, the repo checkout is being done in a different dir than before... It used to be D:/ and now it's C:/ ...

I'll research it and do an PR to fix this. It's weird though.

Also we can consider to remove the execution of this action on every PR and push and only let it check it every week. I think that should be sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants