-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new system option "delay_start_worker" #3749
Comments
That parameter If setting I expect to add a new option to ServerEngine and extend the logic around this to implement this. |
I created a POC. If restarting just one worker, this will work, but if restarting multiple workers at the same time, I think a good solution is to let each WorkerMonitor manage its sleeping state. |
Looking at your patch, what do you think about this approach? # First, get the base timestamp
checkstart = Time.now.to_f
@monitors.each_with_index do |m,wid|
...
elsif wid < @num_workers
# scale up or reboot
unless @stop
unless m
# Now, we do something like this in start_new_worker():
# wait = checkstart + restart_worker_delay - Time.now.to_f
# sleep wait if wait > 0
@monitors[wid] = start_new_worker(wid, checkstart)
else
...
end
end In this way, the manager will wait |
Thank you for the advice. I'll think about it, including the ideas you gave me. |
This poc-2 works well, except that the fix is bigger. And we need to think about co-existing with the existing |
Good. Let's implement the best idea! |
Thank you!
I fixed this: Now the compatibility with the existing option This poc-2 would be one possible fix to this issue. TODO
|
Indeed, this approach could solve the problem with a simpler fix. I try this as a poc-3 I think the direction is right, but somehow it's not working the way it's supposed to... |
I created PR for ServerEngine based on the poc-2. |
treasure-data/serverengine#120 Merged to ServerEngine. |
Is your feature request related to a problem? Please describe.
When a worker process died for some reason, Fluentd always tries
to restart the worker as soon as possible.
This "always-restart-immdeiately" policy does not always work fine.
For example, think the following cases:
The worker process was killed because the host OS is almost running
out of the available memory at the moment.
The worker process was killed because the host OS tries to perform
operation but Fluentd locked the resource (e.g. file locks).
Describe the solution you'd like
Systemd provides
RestartSec
option that allows to wait for a fewseconds before restarting a service to solve the same issue.
https://www.freedesktop.org/software/systemd/man/systemd.service.html#RestartSec=
It would be better if Fluentd's supervisor can provide a similar option too.
Describe alternatives you've considered
NA
Additional context
The underlying serverengine has a feature called "delayed_start_worker".
The option described above can be iimplemented on it (we need to tweak the
serverengine as well, though)
https://github.com/treasure-data/serverengine/blob/master/lib/serverengine/multi_worker_server.rb#L132-L144
The text was updated successfully, but these errors were encountered: