Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new system option "delay_start_worker" #3749

Closed
fujimotos opened this issue May 19, 2022 · 10 comments · Fixed by #3768
Closed

Add a new system option "delay_start_worker" #3749

fujimotos opened this issue May 19, 2022 · 10 comments · Fixed by #3768
Assignees

Comments

@fujimotos
Copy link
Member

fujimotos commented May 19, 2022

Is your feature request related to a problem? Please describe.

When a worker process died for some reason, Fluentd always tries
to restart the worker as soon as possible.

This "always-restart-immdeiately" policy does not always work fine.
For example, think the following cases:

  • The worker process was killed because the host OS is almost running
    out of the available memory at the moment.

  • The worker process was killed because the host OS tries to perform
    operation but Fluentd locked the resource (e.g. file locks).

Describe the solution you'd like

Systemd provides RestartSec option that allows to wait for a few
seconds before restarting a service to solve the same issue.

https://www.freedesktop.org/software/systemd/man/systemd.service.html#RestartSec=

It would be better if Fluentd's supervisor can provide a similar option too.

Describe alternatives you've considered

NA

Additional context

The underlying serverengine has a feature called "delayed_start_worker".
The option described above can be iimplemented on it (we need to tweak the
serverengine as well, though)

https://github.com/treasure-data/serverengine/blob/master/lib/serverengine/multi_worker_server.rb#L132-L144

@daipom daipom self-assigned this May 19, 2022
@daipom
Copy link
Contributor

daipom commented May 25, 2022

The underlying serverengine has a feature called "delayed_start_worker". The option described above can be iimplemented on it (we need to tweak the serverengine as well, though)

https://github.com/treasure-data/serverengine/blob/master/lib/serverengine/multi_worker_server.rb#L132-L144

That parameter delayed_start_worker seems to work to set the interval when multiple workers are started sequentially.

If setting delayed_start_worker: 10, worker 0 is started immediately and worker 1 is started 10 seconds after the start of worker 0.

I expect to add a new option to ServerEngine and extend the logic around this to implement this.

@daipom
Copy link
Contributor

daipom commented May 25, 2022

I created a POC.

daipom/serverengine@65ce4e2

If restarting just one worker, this will work, but if restarting multiple workers at the same time,
restarting each worker is delayed one by one.
This is not a good behavior.

I think a good solution is to let each WorkerMonitor manage its sleeping state.
I will implement it.

@fujimotos
Copy link
Member Author

I think a good solution is to let each WorkerMonitor manage its sleeping state.
I will implement it.

Looking at your patch, what do you think about this approach?

# First, get the base timestamp 
checkstart = Time.now.to_f

@monitors.each_with_index do |m,wid|
  ...
  elsif wid < @num_workers
    # scale up or reboot
    unless @stop
      unless m
        # Now, we do something like this in start_new_worker():
        #    wait = checkstart + restart_worker_delay - Time.now.to_f
        #    sleep wait if wait > 0
        @monitors[wid] = start_new_worker(wid, checkstart)
      else
      ...
  end
end

In this way, the manager will wait restart_worker_delay seconds before
re-launching the first worker, but won't wait for the subsequent workers.

@daipom
Copy link
Contributor

daipom commented May 25, 2022

Thank you for the advice.
Actually, I have already implemented the new POC:

I'll think about it, including the ideas you gave me.

@daipom
Copy link
Contributor

daipom commented May 25, 2022

This poc-2 works well, except that the fix is bigger.

And we need to think about co-existing with the existing start_worker_delay option.
I think this is to prevent multiple workers from starting up at the same time, also when restarting.

@fujimotos
Copy link
Member Author

Good. Let's implement the best idea!

@daipom
Copy link
Contributor

daipom commented May 25, 2022

Thank you!

This poc-2 works well, except that the fix is bigger.

And we need to think about co-existing with the existing delayed_start_worker option. I think this is to prevent multiple workers from starting up at the same time, also when restarting.

I fixed this:

Now the compatibility with the existing option start_worker_delay has been achieved.

This poc-2 would be one possible fix to this issue.

TODO

  • Add test
  • Improve the option-name (The current name is confusing)

@daipom
Copy link
Contributor

daipom commented May 25, 2022

In this way, the manager will wait restart_worker_delay seconds before re-launching the first worker, but won't wait for the subsequent workers.

Indeed, this approach could solve the problem with a simpler fix.

I try this as a poc-3

I think the direction is right, but somehow it's not working the way it's supposed to...

@daipom
Copy link
Contributor

daipom commented May 26, 2022

I created PR for ServerEngine based on the poc-2.

@daipom
Copy link
Contributor

daipom commented Jun 10, 2022

treasure-data/serverengine#120

Merged to ServerEngine.

ashie added a commit that referenced this issue Jun 13, 2022
* Enable to set interval of restarting workers
  See #3749 and #3768
* Fix a bug that fluentd doesn't release its own log file even after
  rotated by external tools
  Fix #2200 and #2436

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
ashie added a commit that referenced this issue Jun 13, 2022
* Enable to set interval of restarting workers
  See #3749 and #3768
* Fix a bug that fluentd doesn't release its own log file even after
  rotated by external tools
  Fix #2200 and #2426

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants