Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Apr 30, 2025

I ended up being able to reproduce the TLS errors after all. However, I cannot track the issue down. For some reason, the connections close and I cannot find the culprit.

One thing I'm attempting here is to refactor the timeout hierarchy to be more sensible.

On main

  • The entire things is wrapped in an asyncio.wait_for(..., 2 * _TEST_TIMEOUT), i.e. 60s
  • The cluster factory is retrying 60 times with a sleep of 1s, i.e. this is in practice longer than the outer timeout
  • start_cluster itself is running for another +30s, hard coded until it times out. With some overhead, this very likely means that the second iteration is hitting the outermost timeout
  • Yet deeper inside, we're timing out individual workers with a death_timeout of min(15, deadline.remaining), i.e. this is shrinking and can even go down to 0s. This is not a big problem but it if obfuscating errors

This PR proposes

  • keep the outer timeout as is
  • Retry until the outer timeout is hit
  • Don't wait between retries. In test scenarios we're not dealing with overloaded remote servers so this second sleep won't help
  • Remove the hard coded 30s timeout in favor of a fraction of the outer timeout (in this case 25%). We might want to make this much more aggressive since 15s to start a couple of subprocesses is still insanely slow.
  • Remove the death timeout entirely

@fjetter fjetter force-pushed the refactor_timeouts_in_start_cluster branch from a5ae716 to a24da40 Compare April 30, 2025 14:12
@github-actions
Copy link
Contributor

github-actions bot commented Apr 30, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  ±0      27 suites  ±0   10h 20m 40s ⏱️ + 4m 25s
 4 107 tests ±0   3 990 ✅ ± 0    105 💤 ± 0  2 ❌ ±0  10 🔥 ±0 
47 447 runs  ±0  45 240 ✅ +16  2 185 💤  - 16  2 ❌ ±0  20 🔥 ±0 

For more details on these failures and errors, see this check.

Results for commit c0b6160. ± Comparison against base commit bf6d37f.

♻️ This comment has been updated with latest results.

@fjetter fjetter force-pushed the refactor_timeouts_in_start_cluster branch from a24da40 to c0b6160 Compare May 5, 2025 06:56
@fjetter fjetter marked this pull request as ready for review May 5, 2025 06:56
@fjetter
Copy link
Member Author

fjetter commented May 5, 2025

A bunch of other (I believe unrelated) things are lighting up but I'm inclined to merge and see what the test reports show once this is on main.

@fjetter fjetter merged commit 358402d into dask:main May 5, 2025
26 of 33 checks passed
@fjetter fjetter deleted the refactor_timeouts_in_start_cluster branch May 5, 2025 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant