Skip to content

Update Client: Adds Workflow-Reset API #687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ustulation
Copy link

@ustulation ustulation commented Nov 8, 2024

What was changed

The client did didn't have a reset workflow api exposed, although the implementation modules have the necessary bits. This PR add the reset API.

Why?

This is just like any other missing api (eg. If terminate() was missing). Without this I need to create boilerplate to reset a workflow in my python codebase. However since that is external to this library it doesn't get the exact same treatment. For eg. the client-impl in this library is wrapped in a series of interceptors if interceptors were provided. But an external code doesn't benefit from any of this, unless we write even more boilerplate.

Checklist

  1. Closes

  2. How was this tested:
    This is the code similar to what I implemented in my application and that works.

@ustulation ustulation requested a review from a team as a code owner November 8, 2024 17:56
@CLAassistant
Copy link

CLAassistant commented Nov 8, 2024

CLA assistant check
All committers have signed the CLA.

@cretz
Copy link
Member

cretz commented Nov 12, 2024

We intentionally do not expose reset as a high-level client call in any SDK as it is considered an advanced/operator call and not one expected to be invoked regularly (well we kinda expose in Go but make you use the proto objects there too). There are several other calls we don't expose high-level client calls for either (e.g. delete). At the least we'd need to design a high-level form across all SDKs and not specific to Python. But like delete and others, we do not want this seen as a regular client call, so we encourage use of the advanced/direct gRPC API if needing to do this advanced thing.

Out of curiosity, can you give more details on your programmatic reset use case?

@ustulation
Copy link
Author

@cretz cheers for looking into this!

Let me start with this:

Out of curiosity, can you give more details on your programmatic reset use case?

While I cannot give you the details due to NDA, I'm just going to cook one example up (so pls imagine this was a serious example instead of a contrived one it actually is). The main idea is that we need a workflow which can be reset to prior point by the end user. The end user doesn't know this is temporal behind the scenes and so on. We provide an overlay DSL and graphs with lineages so the user actually sees that activity-0 is the parent of activity-1, which is the parent of activity-2 and 3 which are thus siblings (executed in parallel) and they converge into activity-4 and so on. In normal case they will run "uninterrupted". However more often than not, activity-1 (for eg) had fetched data which was wrong but which the user had told us was right. Only when a much later activity-8 was being executed was this caught. The user was told that we are unable to progress even after retries and are stuck on activity-8. So they check and conclude that their external data was wrong and they have now corrected it. They now want to rewind all the way back to activity-1 and start from exactly there again. Work done by all the activities prior to activity-1 is still valid thus all the states (inputs and outputs) leading up to activity-1 are valid. Just that activity-1 needs to re-fetch updated data which will thus affect its output and thus all the downstream activities, invalidating them.

Like imagine a web-form where each page is backed by a temporal activity and subsequent page can only be filled once the current page is submitted because it depends on some data from the prev page. However the user is free to go back to any page and edit and redo things from there (ofc losing all the work from the point of rewind).

Reset achieved this perfectly. It preserves all the state up to the point of reset (no activities are actually executed), discards the rest of it (well it terminates the original workflow which is also what we are happy with) and resumes from the point of reset as if it were the first time the workflow is running them.


We intentionally do not expose reset as a high-level client call in any SDK as it is considered an advanced/operator call [...]

I don't see what is "advanced" about it, can you please explain the perils of exposing it? The way I see it is, I can very easily do this via the UI, just like I can very easily terminate or cancel via UI too. If I can trivially do it via the UI I want to also be able to trivially do it via the API. IMO, the decision of whether one presses the button (or equivalently invokes the functionality through the sdk) should lie with the button-clicker/sdk-invoker.

If there are critical drawbacks to using it, it should a part of documentation, just like I'm sure it would have been a part of the UI design via popup-warnings with notes about what you should be aware of before proceeding etc. (there isn't any).

so we encourage use of the advanced/direct gRPC API if needing to do this advanced thing

That would be annoying UX for devs. I hope you reconsider. But ofc my points are based on (1) my lack of understanding of what is so "advanced" about this (2) critical drawback in using this feature (which otherwise has been working perfectly for me for nearly a week now and clients are happy). So if you answer those then I might educate myself.

There are several other calls we don't expose high-level client calls for either (e.g. delete).

I didn't even know delete was a thing. Is it there in the UI? If no then it's hidden anyway. IMO sdk-clients is pretty solid and a joy to use. It should handhold and protect users via documentation, not by deliberate omission of otherwise accessible features.

@cretz
Copy link
Member

cretz commented Nov 13, 2024

I don't see what is "advanced" about it, can you please explain the perils of exposing it?

Reset is meant for workflow-stuck situations and to be manually/explicitly performed by an operator on individual workflows. If you have a user need like you have, you should consider having this in your workflow logic. Maybe that's a loop that is re-iterated based on some user signal. It is clearer/safer in code to model the ability for users to restart some activities than to rely on the blunt reset instrument.

Like imagine a web-form where each page is backed by a temporal activity and subsequent page can only be filled once the current page is submitted because it depends on some data from the prev page. However the user is free to go back to any page and edit and redo things from there (ofc losing all the work from the point of rewind).

This would be much better modeled in workflow logic than using reset to go back. You can use a simple state machine and/or keep track of the steps completed and discard ones if you want to jump back or whatever. Can share/discard state however you might. Think of how you might write this in Python without Temporal from a Python logic approach.

@ustulation
Copy link
Author

You can use a simple state machine and/or keep track of the steps completed and discard ones if you want to jump back or whatever. Can share/discard state however you might. Think of how you might write this in Python without Temporal from a Python logic approach.

But isn't that what temporal is doing for me out of the box when I use the Reset feature? If I wasn't using temporal then I would be doing tons more of state management anyway and yes would have no choice but to state-manage this too. But I am using temporal and it does this bit of state-management for me OOB, so why can't I leverage it?

I can understand if you didn't intend this to be a use case while designing the feature but I still don't see the:

It is clearer/safer in code to model the ability for users to restart some activities than to rely on the blunt reset instrument.

Which is where my confusion is. You are asserting a few things (and I acknowledge it), I'm just asking you to evidence it. Eg. (1) Why cleaner? (2) Why safer? (3) What's so blunt about reset that relying on it is being frowned upon and I'm being asked to write potentially a lot more boilerplate in application logic?

@cretz
Copy link
Member

cretz commented Nov 14, 2024

  1. Why cleaner?

Because logic defined in a workflow is clearer to reason about than logic outside of a workflow. If you have code you need to re-execute for whatever reason, it is clearer for authors/readers of the code to see that logic is expected. It is easier to understand that code may be re-run if the code logic shows that it may be re-run. It also will be easier to reason about in history. You'll also be able to have more control over what re-runs, what state is retained, and other advanced things in code than you will with reset.

(2) Why safer?

There are limitations to workflow reset such as not working with child workflows, limitations with how workflow execution timeout works, etc. You also have to understand stable reset points which can be hard to grasp.

(3) What's so blunt about reset that relying on it is being frowned upon and I'm being asked to write potentially a lot more boilerplate in application logic?

Both reasons stated above. Reset has limitations, is advanced because of those limitations (and understanding its use), and it is clearer to reason about workflows whose code represents the code paths that will be taken.

@ustulation
Copy link
Author

Cheers! Yeah the thing is we don't hardcode a workflow. It's a DSL on top of which folks (which is also us) can cobble up a workflow. As with many DSLs, it will have limitations in expressibility compared to writing every workflow out in a general purpose programming language. Can we change it in such a way as to accommodate internal resets? Definitely. However it's a bit nuanced in that we need to suppress graph arrows (otherwise every node will have a connection to every other node etc) and remember what the input to the node being reset to was so some extra state management needs to be done. Again all this is ofc possible, however I still find vending out reset points and leveraging what temporal already does to be simpler.

Yeah I agree that finding reset points might be a bit surprising at first since there is no API that directly takes in an activity and resets to that (or something similar). What I do right now is take the activity name to reset to from the users (they just click on the right "box" in the displayed DSL graph in UI), get the event history, find the immediately preceding workflow-task-completed and use that as the reset point. This works, specially if the point of reset in the graph was linear with no parallel activities (sibling activities). We don't have child workflows so we are not affected by that (yet).

I still feel that the shortcomings should be caveated but the API exposed for those who want to make use of it, even if for an interim solution. It's there in the UI after all, so just feels off to not be able to easily do it programmatically. In AWS for instance it's usually the opposite, everything that can be done in the console can be done via sdk, but vice versa is definitely not true for so many services.

@cretz
Copy link
Member

cretz commented Nov 25, 2024

It is made more easily available for operators in UI and CLI because we do want to make it easy for operators. We do offer it via direct workflow service form as reset_workflow_execution which you can expose for your use as easily as needed, but we are intentionally choosing not to encourage it as part of our high level programmatic API (same with a few other calls, e.g. delete), only the low level.

In AWS for instance it's usually the opposite, everything that can be done in the console can be done via sdk

This is also true in Temporal, including reset

@ustulation
Copy link
Author

We do offer it via direct workflow service form as reset_workflow_execution which you can expose for your use as easily as needed

@cretz Is this the same as this:

return await self._client.workflow_service.reset_workflow_execution(
            req, retry=True, metadata=input.rpc_metadata, timeout=input.rpc_timeout
        )

in this PR? Can you give me some pointers on how to correctly use it? This is my understanding:

  1. I get the handle to the workflow using the workflow-id , run-id etc with temporal_client.get_workflow_handle().
  2. After that I just use the handle since in many cases it seems to do a bunch of stuff. But in case of reset if my PR is on the right track, it merely delegates the call to self._client._impl.reset_workflow.
  3. Looking at the workflow-handle's __init__() code, doesn't seem like there's any extra initialisation magic, apart from just this bit self.__temporal_eagerly_started = False which doesn't come from the parameters to __init__. I don't know what that does, so I'm just assuming that doesn't really affect the reset call in particular.
  4. So from the code above, I could as well just call client.workflow_service.reset_workflow_execution() and it should do the trick.

However I'm a little unsure about the Outbound interceptors. The actual calls are something like client._impl.<function-name> and though they all eventually converge to client.workflow_service.<equivalent-function> they go through that interceptor indirection.

This is the part that I'm a bit unsure about. As a library-caller, am I going to miss something subtle if I rely on client.workflow_service.<function> vs things going via WorkflowHandle and thus via client._impl internally which has a series of outbound interceptors applied to it? Or is there a way I can access this implementation detail from outside (assuming the convention of leading underscore in _impl means that I shouldn't be relying on it as an outsider)?

@cretz
Copy link
Member

cretz commented Nov 25, 2024

Can you give me some pointers on how to correctly use it?

Yes, similar to what is in this PR, just called from the outside, e.g.

await my_client.workflow_service.reset_workflow_execution(
    temporalio.api.workflowservice.v1.ResetWorkflowExecutionRequest(
        namespace=my_client.namespace,
        workflow_execution=temporalio.api.common.v1.WorkflowExecution(
            workflow_id="my-wf-id"
            run_id="my-optional-run-id"
        ),
        reason="my-optional-reason",
        workflow_task_finish_event_id="my-event-id",
        request_id=str(uuid.uuid4()),
        reset_reapply_exclude_types=my_exclude_types,
    )
)

A workflow handle is not involved since that is effectively just a high-level workflow/run ID holder.

As a library-caller, am I going to miss something subtle if I rely on client.workflow_service. vs things going via WorkflowHandle and thus via client._impl internally which has a series of outbound interceptors applied to it?

Not really going to miss out on anything since this is basically what we call from the high-level client. Interceptors are for high-level calls only, which this is not.

Or is there a way I can access this implementation detail from outside (assuming the convention of leading underscore in _impl means that I shouldn't be relying on it as an outsider)?

Yes, we very intentionally make the workflow_service (and operator_service) available on the client (and make the service_client available so my_client.service_client.workflow_service works as well) so you have access to the full gRPC API.

@ustulation
Copy link
Author

Not really going to miss out on anything since this is basically what we call from the high-level client. Interceptors are for high-level calls only, which this is not.

Ah right, that was my only concern and why I raised this PR and also wrote that in the PR description above:

However since that is external to this library it doesn't get the exact same treatment. For eg. the client-impl in this library is wrapped in a series of interceptors if interceptors were provided. But an external code doesn't benefit from any of this.

But if that's not a problem then yeah I don't really care whether there's a high or a low level facility to call and get the job done.

Thank you very much for engaging and clarifying this! I suppose I should close this PR then (since you won't be exposing this via WorkflowHandle)? (Or you can close it too if you wish, to save a round trip).

Thanks once again!

@cretz
Copy link
Member

cretz commented Nov 25, 2024

No prob, happy to help. I can close.

@cretz cretz closed this Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants