Skip to content

Conversation

@cretz
Copy link
Member

@cretz cretz commented Sep 2, 2025

What was changed

Deadlock detector was tripping on 1000 child workflows at once (see #330). This was because our illegal call tracer can be expensive. We made the following improvements:

  • Narrowly scoped illegal call tracing to just the run_until_all_yielded code which allows us to save many calls to the TracePoint
  • Altered behavior of "disable call tracing" to do a simple flip of the enabled thread variable instead of TracePoint#disable which was showing issues of not being reentrant
  • Make sure disabling illegal call detector with empty map works (it mostly did, but we missed one part)
  • Refactored workflow_instance.rb to no longer delegate two methods deep to apply activation (helps stack traces and code legibility)
  • Fixed issue where 3.2 GC test was failing due to unpredictable GC

Checklist

  1. Hopefully addresses [Bug] Potential deadlock detected after upgrade to v0.6.0 #330, but not marking closed in case there is more to do

@cretz cretz requested a review from a team as a code owner September 2, 2025 20:17
@cretz cretz marked this pull request as draft September 2, 2025 21:05
@cretz
Copy link
Member Author

cretz commented Sep 2, 2025

Converting to draft while we investigate why test_confirm_garbage_collect is now failing intermittently with this change (i.e. that workflow instances remain around after a manual GC). EDIT: Fixed.

@cretz cretz force-pushed the limit-tracer-scope branch 3 times, most recently from b77f223 to 6ebf37e Compare September 3, 2025 15:27
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These utilities are here just to help display/debug GC issues if there is a failure in the GC test

@cretz cretz force-pushed the limit-tracer-scope branch from 6ebf37e to 846e5d5 Compare September 3, 2025 15:29
@cretz cretz marked this pull request as ready for review September 3, 2025 16:48
parent = {} # child_id -> parent_id
root_of = {} # obj_id -> root_category

roots.each do |category, objs|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't believe you had to write a graph search lol

Copy link
Member Author

@cretz cretz Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI did it, not me 😁. I'd never devote this kind of time to a test utility only invoked on failure, but I figured since I used it and confirmed it's a good utility, I might as well keep it around and use it on failure. I'd also never write such difficult-to-comprehend code, I'd have more human friendliness in lots of aspects.

end
end

def test_many_children
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this trigger the detector before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In slower machines like some CI runners, yes. In faster machines, COUNT needs to be 1000.

@cretz
Copy link
Member Author

cretz commented Sep 4, 2025

Ug, getting GC fail on 3.4 sometimes, and getting server-caused hanging on repeated WORKFLOW_TASK_FAILED_CAUSE_FORCE_CLOSE_COMMAND. I may have more work to do here...

@cretz cretz marked this pull request as draft September 5, 2025 13:03
@cretz cretz force-pushed the limit-tracer-scope branch 4 times, most recently from 177d7d5 to 20c59ba Compare September 5, 2025 19:51
@cretz cretz force-pushed the limit-tracer-scope branch from 20c59ba to 3660e90 Compare September 8, 2025 20:18
@cretz
Copy link
Member Author

cretz commented Sep 8, 2025

I will be handling GC issues separately in #334. I skipped the test and this PR is now good, will merge once CI done.

@cretz cretz marked this pull request as ready for review September 8, 2025 20:43
@cretz cretz merged commit d3391ec into temporalio:main Sep 8, 2025
8 checks passed
@cretz cretz deleted the limit-tracer-scope branch September 8, 2025 21:35
THardy98 added a commit that referenced this pull request Sep 10, 2025
* envconfig impl

* linting

* fix steep type checks

* formatting

* Refactor envconfig classes to use Data.define for immutability

- Refactor ClientConfigTLS, ClientConfigProfile, and ClientConfig to use Data.define

* Rename hash methods to use idiomatic Ruby naming

- Rename from_hash → from_h for all envconfig classes
- Rename to_hash → to_h for all envconfig classes

* Replace TLS config hash with Connection::TLSOptions object

- Rename `to_connect_tls_config` → `to_tls_options`
- Return `Connection::TLSOptions` object instead of plain Hash

* Refactor to use inline pattern instead of temporary variables

* Refactor to_client_connect_options to return tuple for one-liner usage

Changed `to_client_connect_options` to return `[positional_args, keyword_args]`
tuple instead of a hash, enabling clean one-liner client connections:

```ruby
args, kwargs = profile.to_client_connect_options
client = Temporalio::Client.connect(*args, **kwargs)
```

* Make hash methods private

* Remove File.exist? check by using type system for path vs data distinction

- TOML paths (*_path fields) become Pathname objects
- TOML data (*_data fields) remain String objects

* Use symbol keys instead of string keys in Rust FFI bridge

Changed all hash keys from strings to symbols in the Rust-to-Ruby bridge

* Bump golang.org/x/net in /temporalio/test/golangworker (#303)

* Bump tracing-subscriber from 0.3.19 to 0.3.20 in /temporalio (#331)

* Reduce the scope of the illegal call tracer and other tracing fixes (#332)

* Remove unnecessary fully qualified import names

* Remove envconfig import in temporalio.rb

* _source_to_path_and_data prefixed with underscore and private doc visibility

* TLS disabled field now optional

* Rubocop fixes

* Steep fixes

* Renamed envconfig.rb -> env_config.rb
Renamed to_tls_options -> to_client_tls_options

* Pass data string as RString, Rust bridge converts to Vec

* Rename load_client_connect_config -> load_client_connect_options

* Format envconfig.rs

* Nits

* Update core submodule

* Revert irrelevant cancellation changes

* Update Cargo.lock

* Propagate set_headers failure in client bridge

* Add experimental warning

---------

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chad Retz <chad.retz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants