Skip to content

chore(bridge): Refactor TS-Core bridge #1698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 5, 2025

Conversation

mjameswh
Copy link
Contributor

@mjameswh mjameswh commented Apr 29, 2025

What changed

This is a major refactor of the Core Bridge layer. It strives to reduce complexity, and increase reliability and maintainability of the TS SDK bridge layer, as well as realign it with how more recent Core-based SDKs have been designed. More importantly, this refactor is a stepping stone to increase velocity of further development on the TS SDK, introducing the structure and abstractions required to efficiently implement several upcoming SDK features.

Here are some of the most significant changes introduced by this refactor:

  • Bridge APIs (i.e. those define in @temporalio/core-bridge) now adhere to much stricter rules.

    • In short, types defined by the bridge are now designed as Data Transfer Objects.

    • Bridge types now align on Core SDK's APIs, rather than corresponding user-facing APIs). That moves concerns such as data conversion, validation, default value handling, type restructuring, etc, to the TS side, rather than relying on unclear and inconsistent expectations between TS and Rust.

    • All properties on bridge APIs are now non-optional (e.g. can't define a property as someProp?: someType); null (rather than undefined) is used to indicate a value that is intentionally left unset (i.e. the equivalent of None in Rust).

    • These new rules ensure that most incoherencies in type definitions between TS and Rust will be caught early, avoiding some recurrent error patterns we have had previously in TS SDK. Several such bugs were in fact uncovered and fixed while working on this PR.

    • These rules also make the Rust code much simpler, moving complexity back to the TS side, where we have better context and better ability to implement more complex business logic and report back errors. That makes the bridge code more approachable by TS-first developpers, easier to maintain and keep in sync with evolution of the Core SDK, and much more reviewable by our engineers.

  • Most Rust side APIs (including entrypoint functions and most DTO type definitions) are now defined using macros, which avoid explicit handling of type conversion. That previously resulted in extremely verbose code, drowning actual business logic in data access and conversion plumbing.

    • Bridge-level API definitions (i.e. function signatures and DTOs) align almost one-to-one between the TS side and Rust side, allowing side-by-side review. Macros automatically provide required type conversions in and out, including basic type safety checks and error reporting where appropriate.
  • Redresses and documents ownership rules for Rust-to-JS objects. The "Opaque Handle" abstraction introduced by this PR provides a clear yet simple to use ownership model for Runtime, Client, Worker, Ephemeral Servers and other native resources that are handed down to the JS world. This model notably guarantees proper clean up of resources, even when they are mistakenly dropped and garbage collected from the JS side (e.g. omitting to call close or shutdown on a resource), or dropped on the Rust side due to unexpected errors.

  • Bridge APIs now schedule async tasks directly to the Tokio runtime. The bridge itself no longer need event dispatching threads. This considerably simplify the effort requried to add new features to the SDK, and makes the code more approachable for new developers.

    • API calls into Rust now return actual JS Promise objects, rather than expecting TS side use of promissify and callback functions. Those Promise are backed by Rust Future.

      This is a huge benefit from a reliability perspective, as we can now guarantee that a Promise will be properly rejected if it is dropped by the Rust without having been settled. This could previously result in various unexpected behaviors including process hangs and suddent process termination. Fixes [Bug] Node discards promises created by calling into Core Bridge with callback #1302.

    • Thanks to the BridgeResult, BridgeError and TryIntoJs types, async implementation of bridge APIs can directly return appropriate values or errors, without the need for implementer to explicitly handle conversion back into JS values (which can only be done on the Node thread, making this kind of pattern quite messy and verbose to implement).

    • BridgeResult and BridgeError also ensure proper handling of Neon's Throw errors, indicating that the Node thread is currently in "throw state". This particular state makes it difficult to properly handle errors in Rust/Node code, as any attempt to use the JS Context once it has fall into "throw state" will result in a panic. BridgeResult also restore the functionality of the Rust's ? suffix operator, by transparently encapsulating Neon's Throw into BridgeError, and back into Throw once the BridgeResult surfaces back in place from where it can be handled.

    • Introduced other well-defined abstractions for some recurring needs, such as sync and async callback functions into JS, and AbortController. Those were previously reimplemented at each call-site.

  • Bridge APIs are now clearly marked as internal only.

    • Made it very clear that any API defined in the @temporalio/core-bridge package is internal APIs, and offers absolutely no stability guarantee. APIs meant for public usage must be defined and exposed by user facing packages exclusively (i.e. @temporalio/worker and @temporalio/testing). Reexporting bridge APIs from public facing APIs is also not allowed.

    • Public APIs that were previously exported by @temporalio/core-bridge have all been moved to user facing packages. Documentation has been added where missing on public types. Bridge types no longer have typedoc themselves as that would be redundant and unhelpful.

  • Some other note worthy points:

    • Logs forwarding from Rust to JS is now implemented using a push model rather than pull model. All other Core-based SDKs were already using push-based log forwarding.

    • The runtime now ensure proper cleanup of the Otel metrics exporter; this could previously cause tons of error messages on runtime shutdown. Fixes [Bug] Reenable OTLP export tests #1495.

    • Adds support of custom metric bucket sizes. Fixes [Feature Request] Custom metric bucket sizes #1634.

@mjameswh mjameswh requested a review from a team as a code owner April 29, 2025 18:49
@@ -0,0 +1,418 @@
# Bridge Layer Coding Norms, Best Practices and Pertinent Knowledge
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP - I'm leaving this doc here even though it is incomplete (I'll complete at a later time, post-merge). I think it still adds some value in its present form. Feel free to read and comment if curious, but don't feel obliged.

Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good. This is absolutely massive so I couldn't give it super detailed scrutiny but tried to focus on the important looking bits.


////////////////////////////////////////////////////////////////////////////////////////////////////

mod config {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these mods should just be put into their own files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, been hesitant, but I think it helps having that in the same file the feature they relate too. If anything, I would move them out of their ::config modules, into their respective parents. I'll leave it that way for now, that's something that can easily be tuned later.

Comment on lines 174 to 175
enter_sync!(self);
Ok(BridgeFuture::new(Box::pin(future)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would work better with .instrument()? I'm surprised this works since the subscriber guard would be dropped immediately after this function exits

Copy link
Contributor Author

@mjameswh mjameswh May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The future will execute on one of Tokio’s thread, which are bound with specific tracing contexts immediately after being created.

AFAICT the node thread would not need to access the tracing subscriber once the implementation of the “js_function” returns.

Keeping the guard in scope ensure that the tracing subscriber doesn’t bleed into calls made by other runtimes on the same thread, which has been causing issues in a previous attempt.

Comment on lines +23 to +27
* - Always use `null` to indicate an intentionally unspecified optional value
* in TypeScript interfaces. This will be converted to `None` on the Rust side.
* - Explicitly set _every_ properties on objects sent to the native code,
* including optional properties (e.g. `{ prop: input.prop ?? null }`).
* - Never use the "optional property" syntax in TypeScript (i.e. `prop?: T`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

@SymbolBasedInstanceOfError('ServiceError')
export class ServiceError extends Error implements grpc.ServiceError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would ideally merge TransportError and ServiceError, they're different representations of the tonic error.

You can remove this new error type and add the new properties to TransportError.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was too lazy to do this as part of my PR but with your from and into helpers, it should be fairly straightforward.

*/
async withAbortSignal<ReturnType>(abortSignal: AbortSignal, fn: () => Promise<ReturnType>): Promise<ReturnType> {
const cc = this.callContextStorage.getStore();
return await this.callContextStorage.run({ ...cc, abortSignal }, fn);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this implementation is broken. if there's already an abortSignal on the context, it should be chained with the newly provided one. The implementation is broken in the client package too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we completely missed that indeed. And probably most devs using abort signals did too.

I'll note for the moment, but won't change the semantic in this PR.

@mjameswh mjameswh force-pushed the bridge-refactor branch from 6310f1e to 0cf49fa Compare May 5, 2025 07:31
@mjameswh mjameswh merged commit 94aad7d into temporalio:main May 5, 2025
23 checks passed
@mjameswh mjameswh deleted the bridge-refactor branch May 5, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants