Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interval to restart worker #120

Merged
merged 12 commits into from
Jun 10, 2022

Conversation

daipom
Copy link
Contributor

@daipom daipom commented May 26, 2022

Issue: fluent/fluentd#3749

This fix adds a new option restart_worker_interval for multithread/multiprocess server.

This option prevents workers from being restarted immediately after being killed,
and this is useful if you want workers to stop for a while.

This fix is compatible with the existing option: start_worker_delay,
which prevents multiple workers from starting/restarting at the same time.

TODO

  • Add test

@daipom daipom changed the title Draft: Add interval to restart worker Add interval to restart worker May 30, 2022
@daipom
Copy link
Contributor Author

daipom commented May 31, 2022

Some tests fails on Windows with Ruby3 (3.0.3, 3.1).
Some newly added tests are listed there.

These are probably the same causes.
It seems that test_state sometimes does not return the correct value.

Example (on Windows with Ruby 3.0.3):

  6) ServerEngine::Supervisor when using pipe as command_sender immediate stop
     Failure/Error: test_state(:server_stop).should == 1
       expected: 1
            got: 2 (using ==)
     # ./spec/supervisor_spec.rb:113:in `block (4 levels) in <top (required)>'

  7) ServerEngine::Supervisor when using pipe as command_sender graceful restart
     Failure/Error: test_state(:server_stop).should == 1
       expected: 1
            got: 3 (using ==)
     # ./spec/supervisor_spec.rb:135:in `block (4 levels) in <top (required)>'

  8) ServerEngine::Supervisor when using pipe as command_sender immediate restart
     Failure/Error: test_state(:server_stop).should == 1
       expected: 1
            got: 4 (using ==)
     # ./spec/supervisor_spec.rb:159:in `block (4 levels) in <top (required)>'

  9) ServerEngine::Supervisor when using pipe as command_sender reload
     Failure/Error: test_state(:server_stop).should == 1
       expected: 1
            got: 5 (using ==)
     # ./spec/supervisor_spec.rb:182:in `block (4 levels) in <top (required)>'

It appears that state.yml is not initialized correctly at https://github.com/treasure-data/serverengine/blob/master/spec/server_worker_context.rb#L7-L8

But I have no idea what caused that.

daipom added a commit to daipom/fluentd that referenced this pull request May 31, 2022
This fix needs the fix of ServerEngine:

* treasure-data/serverengine#120

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
@fujimotos
Copy link
Contributor

@daipom OK. I spent this evening testing this patch. I could confirm
that the new restart_worker_interval option works.

Here is some snippets from my testing.

# Multi-Process server code retrieved from README.md
# with workers=2 + restart_worker_interval=5.
#
# Process 18222 got killed at 17:16:28, and start running
# again at 17:16:33 (as 18223).
$ bundle exec test-server.rb
I, [2022-05-31T17:16:27.304833 #18229]  INFO -- : Awesome work!
I, [2022-05-31T17:16:28.304004 #18226]  INFO -- : Awesome work!
E, [2022-05-31T17:16:28.306409 #18222] ERROR -- : Worker 1 finished unexpectedly with status 0
I, [2022-05-31T17:16:29.304818 #18226]  INFO -- : Awesome work!
I, [2022-05-31T17:16:30.305964 #18226]  INFO -- : Awesome work!
I, [2022-05-31T17:16:31.307159 #18226]  INFO -- : Awesome work!
I, [2022-05-31T17:16:32.308340 #18226]  INFO -- : Awesome work!
I, [2022-05-31T17:16:33.309509 #18226]  INFO -- : Awesome work!
I, [2022-05-31T17:16:33.804000 #18233]  INFO -- : Awesome work!

That said, the worrying part of the current patch is that it seems
kinda too big. Not including the test & documentation changes,
it already contains >100 change lines:

$ git fetch https://github.com/daipom/serverengine add-interval-restarting-worker
From https://github.com/daipom/serverengine
 * branch                      add-interval-restarting-worker -> FETCH_HEAD
$ git diff FETCH_HEAD --stat lib/ 
 lib/serverengine/multi_process_server.rb |  9 ++----
 lib/serverengine/multi_thread_server.rb  |  7 +----
 lib/serverengine/multi_worker_server.rb  | 93 +++++----------------------------------------------------
 3 files changed, 12 insertions(+), 97 deletions(-)

That's not necessary a bad thing, but I do feel that the change
we actually need is only a tiny fraction of it.

Here's a simplified version of the patch. It works as follows:

  1. Add a new instance variable @restart_at to WorkerMonitor.
  2. If worker is dead and @restart_at is nil, schedule the restart
    at Time.now() + @restart_worker_interval
  3. Once the current time exceeds @restart_at, do restart the
    worker.

In that way, it works just as the current version, except it only
requires 20 lines. What do you think?

 lib/serverengine/multi_process_server.rb |  2 ++
 lib/serverengine/multi_thread_server.rb  |  3 +++
 lib/serverengine/multi_worker_server.rb  | 15 ++++++++++++++-
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/lib/serverengine/multi_process_server.rb b/lib/serverengine/multi_process_server.rb
index 19cd72a73842..41761b4efbb1 100644
--- a/lib/serverengine/multi_process_server.rb
+++ b/lib/serverengine/multi_process_server.rb
@@ -105,9 +105,11 @@ module ServerEngine
         @unrecoverable_exit_codes = unrecoverable_exit_codes
         @unrecoverable_exit = false
         @exitstatus = nil
+        @restart_at = nil
       end
 
       attr_reader :exitstatus
+      attr_accessor :restart_at
 
       def send_stop(stop_graceful)
         @stop = true
diff --git a/lib/serverengine/multi_thread_server.rb b/lib/serverengine/multi_thread_server.rb
index 0b3e2d121619..6615937d8f9d 100644
--- a/lib/serverengine/multi_thread_server.rb
+++ b/lib/serverengine/multi_thread_server.rb
@@ -39,8 +39,11 @@ module ServerEngine
       def initialize(worker, thread)
         @worker = worker
         @thread = thread
+        @restart_at = nil
       end
 
+      attr_accessor :restart_at
+
       def send_stop(stop_graceful)
         Thread.new do
           begin
diff --git a/lib/serverengine/multi_worker_server.rb b/lib/serverengine/multi_worker_server.rb
index 60b650fe1196..d8abacaf267c 100644
--- a/lib/serverengine/multi_worker_server.rb
+++ b/lib/serverengine/multi_worker_server.rb
@@ -85,6 +85,7 @@ module ServerEngine
 
       @start_worker_delay = @config[:start_worker_delay] || 0
       @start_worker_delay_rand = @config[:start_worker_delay_rand] || 0.2
+      @restart_worker_interval = @config[:restart_worker_interval] || 0
 
       scale_workers(@config[:workers] || 1)
 
@@ -116,7 +117,11 @@ module ServerEngine
         elsif wid < @num_workers
           # scale up or reboot
           unless @stop
-            @monitors[wid] = delayed_start_worker(wid)
+            if m and @restart_worker_interval > 0
+              restart_worker(m, wid)
+            else
+              @monitors[wid] = delayed_start_worker(wid)
+            end
             num_alive += 1
           end
 
@@ -129,6 +134,14 @@ module ServerEngine
       return num_alive
     end
 
+    def restart_worker(m, wid)
+      if m.restart_at.nil?
+        m.restart_at = Time.now() + @restart_worker_interval
+      elsif m.restart_at <= Time.now()
+        @monitors[wid] = start_worker(wid)
+      end
+    end
+
     def delayed_start_worker(wid)
       if @start_worker_delay > 0
         delay = @start_worker_delay +

@daipom
Copy link
Contributor Author

daipom commented Jun 1, 2022

@fujimotos Thank you for the feedback!

I'd like to care about maintainability, but there certainly seems to be some unnecessary and large modifications!
In particular, it seems that we do not need to create a base class like WorkerMonitorBase.

I'll try to remove it first.

@daipom daipom force-pushed the add-interval-restarting-worker branch from 83b450f to bbbd8b7 Compare June 1, 2022 01:38
@daipom
Copy link
Contributor Author

daipom commented Jun 1, 2022

How about this, trying to keep both the small diff and the clear logic?

% git diff FETCH_HEAD --stat lib/ 
 lib/serverengine/multi_process_server.rb |  2 ++
 lib/serverengine/multi_thread_server.rb  |  3 +++
 lib/serverengine/multi_worker_server.rb  | 41 +++++++++++++++++++++++++++++++++--------
 3 files changed, 38 insertions(+), 8 deletions(-)

@daipom
Copy link
Contributor Author

daipom commented Jun 1, 2022

I changed the test code a little to avoid test failures at 5df9c68,
although I don't know if this will work around it.

@ashie
Copy link
Collaborator

ashie commented Jun 1, 2022

Hmm, still a test failed on Ruby 3.0 and 3.1:

Failures:

  1) ServerEngine::MultiSpawnServer keepalive_workers default 
     Failure/Error: monitors.count { |m| m.alive? }.should == workers
       expected: 3
            got: 2 (using ==)
     # ./spec/multi_spawn_server_spec.rb:65:in `block (4 levels) in <top (required)>'

@daipom
Copy link
Contributor Author

daipom commented Jun 1, 2022

For some reason, tests unrelated to this fix no longer fail..

@fujimotos
Copy link
Contributor

fujimotos commented Jun 1, 2022

OK, This is much better! I think this is comittable once the
unit tests for Ruby runs green.

  • start_worker_delay sets the delay between each worker-start
  • restart_worker_interval sets wait time before restarting a stopped worker

One possible further improvement is the configuration naming.

Since we already have this option called start_worker_delay,
I gather it makes more sense to use the same terminology here
(such as "restart_worker_delay").

But this is just an casual observation, and I don't have a dog in it.

Failure/Error: monitors.count { |m| m.alive? }.should == workers
expected: 3
got: 2 (using ==)

Talking about the test failure, it appears that something is wrong
with the sleep() call before this assertion.

sleep(start_worker_delay * (workers - 1)) 
monitors.count { |m| m.alive? }.should == workers

My psychic guess is that it's probably because we do not take
@last_start_worker_time into the computation here, but we
need to trace the test execution to determine the actual cause.

@daipom
Copy link
Contributor Author

daipom commented Jun 1, 2022

This could be the cause.
This log reproduces in my local environment when testing with only start_worker_delay.

E, [2022-06-01T02:21:25.335585 #5692] ERROR -- : Worker 0 finished unexpectedly with status 0

@daipom
Copy link
Contributor Author

daipom commented Jun 1, 2022

It seems that the worker always stops after about 26 seconds in these tests.
This may make tests fail.

Example: ServerEngine::MultiSpawnServer starts worker processes with command_sender=pipe

  • This is OK because sleep-time is less than 26 seconds.
@@ -17,6 +17,9 @@ describe ServerEngine::MultiSpawnServer do
           Timeout.timeout(5) do
             sleep(0.5) until test_state(:worker_run) == 2
           end
+
+          sleep(24)
+
           test_state(:worker_run).should == 2
         ensure
% bundle exec rspec spec\multi_spawn_server_spec.rb -e "starts worker processes"
Run options: include {:full_description=>/starts\ worker\ processes/}
I, [2022-06-01T17:00:56.481793 #18280]  INFO -- : Received graceful stop
I, [2022-06-01T17:00:56.719072 #18280]  INFO -- : Worker 0 finished with status 0
I, [2022-06-01T17:00:56.719695 #18280]  INFO -- : Worker 1 finished with status 0
.

Finished in 25.76 seconds
1 example, 0 failures
  • This is NG because sleep-time is larger than 26 seconds.
@@ -17,6 +17,9 @@ describe ServerEngine::MultiSpawnServer do
           Timeout.timeout(5) do
             sleep(0.5) until test_state(:worker_run) == 2
           end
+
+          sleep(28)
+
           test_state(:worker_run).should == 2
         ensure
% bundle exec rspec spec\multi_spawn_server_spec.rb -e "starts worker processes"
Run options: include {:full_description=>/starts\ worker\ processes/}
E, [2022-06-01T17:01:55.061765 #17836] ERROR -- : Worker 0 finished unexpectedly with status 0
E, [2022-06-01T17:01:55.081210 #17836] ERROR -- : Worker 1 finished unexpectedly with status 0
I, [2022-06-01T17:01:58.352816 #17836]  INFO -- : Received graceful stop
I, [2022-06-01T17:01:58.631026 #17836]  INFO -- : Worker 0 finished with status 0
I, [2022-06-01T17:01:58.631597 #17836]  INFO -- : Worker 1 finished with status 0
F

Failures:

  1) ServerEngine::MultiSpawnServer starts worker processes with command_sender=pipe
     Failure/Error: test_state(:worker_run).should == 2
       expected: 2
            got: 4 (using ==)
     # ./spec/multi_spawn_server_spec.rb:24:in `block (4 levels) in <top (required)>'

Finished in 29.79 seconds
1 example, 1 failure

@daipom
Copy link
Contributor Author

daipom commented Jun 1, 2022

ServerEngine::MultiSpawnServer keepalive_workers with only start_worker_delay, the test that is currently failing,
is affected by this because it sleeps so long after the restart that it exceeds this 26-second period.

@daipom
Copy link
Contributor Author

daipom commented Jun 1, 2022

Seems to be intentional by this code.

https://github.com/treasure-data/serverengine/blob/master/spec/server_worker_context.rb#L166-L174

  def run
    incr_test_state :worker_run
    5.times do
      # repeats 5 times because signal handlers
      # interrupts wait
      @stop_flag.wait(5.0)
    end
    @stop_flag.reset!
  end

@daipom daipom force-pushed the add-interval-restarting-worker branch from 3a013cf to feeef16 Compare June 1, 2022 09:36
@daipom
Copy link
Contributor Author

daipom commented Jun 1, 2022

I think this fix will cure the test, so could you please run the test again?

  • start_worker_delay sets the delay between each worker-start
  • restart_worker_interval sets wait time before restarting a stopped worker

One possible further improvement is the configuration naming.

Since we already have this option called start_worker_delay, I gather it makes more sense to use the same terminology here (such as "restart_worker_delay").

That bothered me too.
Actually, I'm using the different terminology intentionally,
since start_worker_delay is applied to restarting too.

I thought it would be less confusing to distinguish between delay or interval
than between start or restart.

However, if this seems rather more confusing, it would be better to fix it.

What do you think?

@fujimotos fujimotos self-requested a review June 2, 2022 00:56
Copy link
Contributor

@fujimotos fujimotos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'm fine with this patch.

@ashie
Copy link
Collaborator

ashie commented Jun 3, 2022

The patch itself seems good, but it makes whole test time over 2 times longer than before.
Can we use timecop to reduce test time?

@daipom
Copy link
Contributor Author

daipom commented Jun 3, 2022

I think the new test sleeps for more than a minute.
Using this for new tests would greatly reduce the time. I will try it.

@daipom
Copy link
Contributor Author

daipom commented Jun 8, 2022

I have added the fix to reduce test time.

This will reduce the test time by about one minute.
After all, the new additional tests will increase the test time by about 40 seconds.

@daipom
Copy link
Contributor Author

daipom commented Jun 8, 2022

Fails on Windows again.. :(
It's difficult as I have succeeded on my local Windows,,, I'll check.
Sorry for the trouble.

@ashie
Copy link
Collaborator

ashie commented Jun 8, 2022

Hmm, still unstable on Windows: https://github.com/treasure-data/serverengine/actions/runs/2459932320

@ashie
Copy link
Collaborator

ashie commented Jun 8, 2022

Fails on Windows again.. :(
It's difficult as I have succeeded on my local Windows,,, I'll check.
Sorry for the trouble.

Don't worry, tuning test timing on Windows is hard, I also often get trouble on it.

@daipom
Copy link
Contributor Author

daipom commented Jun 8, 2022

Fails on Windows again.. :(
It's difficult as I have succeeded on my local Windows,,, I'll check.
Sorry for the trouble.

Don't worry, tuning test timing on Windows is hard, I also often get trouble on it.

Thank you.
Starting or killing multiple processes seems to be extremely slow on Windows!
I made a fix to allow more time.

@ashie ashie force-pushed the add-interval-restarting-worker branch from 4cc76de to a25ab41 Compare June 9, 2022 01:37
@ashie
Copy link
Collaborator

ashie commented Jun 9, 2022

I've rebased this on top of master branch.

@daipom
Copy link
Contributor Author

daipom commented Jun 9, 2022

Thank you.

Unit testing with Ruby 2.7 on ubuntu-latest

  1) ServerEngine::MultiSpawnServer keepalive_workers with only restart_worker_interval 
     Failure/Error: -> {
       Not all workers restarted correctly.
     # ./spec/multi_spawn_server_spec.rb:111:in `block (4 levels) in <top (required)>'

This shows that it can take more than 5 seconds to start 3 workers on Ubuntu as well as Windows.

Unit testing with Ruby 2.7 on windows-latest

  1) ServerEngine::MultiSpawnServer keepalive_workers default 
     Failure/Error: server.stop(true)
     Errno::EPIPE:
       Broken pipe
     # ./lib/serverengine/process_manager.rb:336:in `write'
     # ./lib/serverengine/process_manager.rb:336:in `send_command'
     # ./lib/serverengine/process_manager.rb:378:in `start_graceful_stop!'
     # ./lib/serverengine/multi_process_server.rb:117:in `send_stop'
     # ./lib/serverengine/multi_worker_server.rb:35:in `block in stop'
     # ./lib/serverengine/multi_worker_server.rb:34:in `each'
     # ./lib/serverengine/multi_worker_server.rb:34:in `stop'
     # ./spec/multi_spawn_server_spec.rb:71:in `block (4 levels) in <top (required)>'

This error is completely unexpected.

Unit testing with Ruby 3.1 on windows-latest

Some tests that have nothing to do with this fix is even failing...

@ashie
Copy link
Collaborator

ashie commented Jun 9, 2022

Hmm, they cannot reproduce on my local environment, all tests are passed...

@ashie
Copy link
Collaborator

ashie commented Jun 9, 2022

Unit testing with Ruby 2.7 on windows-latest

  1) ServerEngine::MultiSpawnServer keepalive_workers default 
     Failure/Error: server.stop(true)
     Errno::EPIPE:
       Broken pipe
     # ./lib/serverengine/process_manager.rb:336:in `write'
     # ./lib/serverengine/process_manager.rb:336:in `send_command'
     # ./lib/serverengine/process_manager.rb:378:in `start_graceful_stop!'
     # ./lib/serverengine/multi_process_server.rb:117:in `send_stop'
     # ./lib/serverengine/multi_worker_server.rb:35:in `block in stop'
     # ./lib/serverengine/multi_worker_server.rb:34:in `each'
     # ./lib/serverengine/multi_worker_server.rb:34:in `stop'
     # ./spec/multi_spawn_server_spec.rb:71:in `block (4 levels) in <top (required)>'

This error is completely unexpected.

send_command shouldn't raise error as well as send_signal.
I'll fix it by #125.

@daipom
Copy link
Contributor Author

daipom commented Jun 9, 2022

Thank you! That is certainly true!

But I don't know why this error occurs.
This means somehow some processes had already died at send_command?
If it is just a cleanup issue in a test environment, it may be ignorable..., or maybe I should look into it a little more carefully...

daipom added 11 commits June 9, 2022 16:11
Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Added a new option: `restart_worker_interval`.
This option can work with the existing option `start_worker_delay`.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
This explanation was confusing with the new option `restart_worker_interval`,
so I made the difference clear.

Note: The behavior of `start_worker_delay` option doesn't change, this
just makes the explanation clearer.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
This fix cause workers to remain stopped for a while, although they are
scheduled to restart later.

Workers in that state should be included in this count because it should
not break if there are workers scheduled to restart.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Some new tests needs the longer timeout.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Use `timecop` and shorten some delay options to reduce test time.

`$ bundle exec rspec spec/multi_spawn_server_spec.rb -e "keepalive_workers"`

- before: 1m51.6s
- after: 41.6s

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
@ashie ashie force-pushed the add-interval-restarting-worker branch from a25ab41 to bf88f9a Compare June 9, 2022 07:11
serverengine.gemspec Outdated Show resolved Hide resolved
Signed-off-by: Takuro Ashie <ashie@clear-code.com>
@ashie
Copy link
Collaborator

ashie commented Jun 10, 2022

It seems that CI becomes stable after merging #121 #125 #126.

@ashie
Copy link
Collaborator

ashie commented Jun 10, 2022

So I merge this.
If you find any other problems, please open new issue or pull request.

@ashie ashie merged commit dcc591a into treasure-data:master Jun 10, 2022
@daipom
Copy link
Contributor Author

daipom commented Jun 10, 2022

Thank you so much for the fix and the merge.

I wonder why this solves the timeout of restarting,
but it would be good to see how it goes by this.

If you find any other problems, please open new issue or pull request.

Sure! Thank you!

@daipom daipom deleted the add-interval-restarting-worker branch June 10, 2022 02:30
daipom added a commit to daipom/fluentd that referenced this pull request Jun 10, 2022
This fix needs the fix of ServerEngine:

* treasure-data/serverengine#120

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
daipom added a commit to daipom/fluentd that referenced this pull request Jun 14, 2022
This fix needs the fix of ServerEngine:

* treasure-data/serverengine#120

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants