Skip to content

Remove unused __started Event in Server #6615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 27, 2022

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jun 23, 2022

I fixed this issue before for the worker, see #5910, but apparently the scheduler is now affected as well.
From what I can gather this is related to many recent port collisions on CI

see also coiled/benchmarks#170

I still need to figure out how to test this the best way. This is not a permanent fix but I think it should resolve the problem at hand.
I consider #6616 as a permanent fix

I believe no servers can be actually restarted this way after #5910. coiled/benchmarks#170 was running on 2022.1.0 which didn't include this fix. I was not able to reproduce this otherwise. I opted to remove the unused event instead.

@fjetter fjetter changed the title Servers cannot restart again Servers are not restarted on incoming request Jun 23, 2022
@fjetter fjetter requested a review from graingert June 23, 2022 10:03
@fjetter
Copy link
Member Author

fjetter commented Jun 23, 2022

I can't reproduce this issue using any tests, yet. maybe #5910 was already a fix.

However, our old friend test_dashboard_port_zero is failing. It picks proper ports, the scheduler comes up but judging from the timestamps, the scheduler closes pretty soon again

2022-06-23 10:48:33,378 - distributed.scheduler - INFO - -----------------------------------------------
2022-06-23 10:48:35,201 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-06-23 10:48:35,324 - distributed.scheduler - INFO - State start
2022-06-23 10:48:35,329 - distributed.scheduler - INFO - -----------------------------------------------
2022-06-23 10:48:35,329 - distributed.scheduler - INFO - Clear task state
2022-06-23 10:48:35,330 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:49760
2022-06-23 10:48:35,330 - distributed.scheduler - INFO -   dashboard at:           127.0.0.1:49773
2022-06-23 10:48:35,825 - distributed.scheduler - INFO - Scheduler closing...
2022-06-23 10:48:35,825 - distributed.scheduler - INFO - Scheduler closing all comms
2022-06-23 10:48:35,826 - distributed.scheduler - INFO - Stopped scheduler at 'tcp://127.0.0.1:49760'
2022-06-23 10:48:35,827 - distributed.scheduler - INFO - End scheduler

We can see that almost exactly 500ms from the start 35,324, the scheduler closes at 35,825.

Is anybody aware of an interval we set to 500ms that could cause a scheduler to close?

@fjetter
Copy link
Member Author

fjetter commented Jun 23, 2022

Another confusing part is that the CI run https://github.com/dask/distributed/runs/7021261881?check_suite_focus=true

shows the following code for the test

deftest_dashboard_port_zero(loop):
        pytest.importorskip("bokeh")
        port = open_port()
with popen(
            [
"dask-scheduler",
"--host",
f"127.0.0.1:{port}",
"--dashboard-address",
":0",
            ],
        ):
>           with Client(f"tcp://127.0.0.1:{port}", loop=loop) as c:

which is the new version of this test after #6591

but my branch should still include the old version of the test, see https://github.com/fjetter/distributed/blob/eb679cc3912fa33cd6f24575973367ce99d71fcd/distributed/cli/tests/test_dask_scheduler.py#L217-L226

@github-actions
Copy link
Contributor

github-actions bot commented Jun 23, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   10h 21m 30s ⏱️ - 37m 50s
  2 890 tests ±0    2 805 ✔️ +2    84 💤 +1  1  - 2 
21 408 runs  ±0  20 437 ✔️  - 1  970 💤 +6  1  - 4 

For more details on these failures, see this check.

Results for commit d22cbbb. ± Comparison against base commit 4779f84.

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member Author

fjetter commented Jun 23, 2022

well, I rebased and cherry-picked 9f9e1c2 onto it
Let's see what CI thinks

@fjetter fjetter requested review from crusaderky and gjoseph92 June 23, 2022 15:37
@crusaderky
Copy link
Collaborator

crusaderky commented Jun 27, 2022

Another confusing part is that the CI run [...] shows [...] the new version of this test after #6591 but my branch should still include the old version of the test

When a github action that contains

      - name: Checkout source
        uses: actions/checkout@v2

is triggered by a pull_request event, what's checked out is a copy of main with your branch merged in.
The tests in https://github.com/fjetter/distributed/actions, which are instead triggerd by the push event, will contain the old code.

@fjetter fjetter force-pushed the fix_scheduler_restarting branch from 99b5220 to a6aa916 Compare June 27, 2022 12:37
@fjetter fjetter changed the title Servers are not restarted on incoming request Remove unused __started attribute in Server Jun 27, 2022
@fjetter fjetter changed the title Remove unused __started attribute in Server Remove unused __started Event in Server Jun 27, 2022
@crusaderky crusaderky merged commit 54cd71d into dask:main Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants