Skip to content

Commit 4635f5e

Browse files
committed
zero-downtime restart: add initial document
fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
1 parent 47d7af5 commit 4635f5e

File tree

6 files changed

+124
-4
lines changed

6 files changed

+124
-4
lines changed
153 KB
Loading

SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
* [Linux Capability](deployment/linux-capability.md)
5858
* [Command Line Option](deployment/command-line-option.md)
5959
* [Source Only Mode](deployment/source-only-mode.md)
60+
* [Zero-downtime restart](deployment/zero-downtime-restart.md)
6061
* [Container Deployment](container-deployment/README.md)
6162
* [Docker Image](container-deployment/install-by-docker.md)
6263
* [Docker Logging Driver](container-deployment/docker-logging-driver.md)

deployment/rpc.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,16 @@ As evident from the output above, each endpoint returns a JSON object as its res
2929
| :--- | :---: | :---: |
3030
| `/api/processes.interruptWorkers` | [SIGINT](signals.md#sigint-or-sigterm) | Stops the daemon. |
3131
| `/api/processes.killWorkers` | [SIGTERM](signals.md#sigint-or-sigterm) | Stops the daemon. |
32+
| `/api/processes.zeroDowntimeRestart` | [SIGUSR2](signals.md#sigusr2) | Restarts Fluentd with zero-downtime. |
3233
| `/api/processes.flushBuffersAndKillWorkers` | [SIGUSR1](signals.md#sigusr1) and [SIGTERM](signals.md#sigint-or-sigterm) | Flushes buffer and stops the daemon. |
3334
| `/api/plugins.flushBuffers` | [SIGUSR1](signals.md#sigusr1) | Flushes the buffered messages. |
34-
| `/api/config.gracefulReload` | [SIGUSR2](signals.md#sigusr2) | Reloads configuration. |
3535
| `/api/config.reload` | [SIGHUP](signals.md#sighup) | Reloads configuration. |
36+
| `/api/config.gracefulReload` | --- | Reloads configuration. |
37+
38+
Appendix:
39+
40+
* `/api/processes.zeroDowntimeRestart`: This is supported since v1.18.0 on non-Windows.
41+
* `/api/config.gracefulReload`: This is the replacement of `SIGUSR2` before v1.18.0. Please use `/api/processes.zeroDowntimeRestart` or `/api/config.reload` unless there is a special reason. See [SIGUSR2](signals.md#sigusr2) for details.
3642

3743
If this article is incorrect or outdated, or omits critical information, please [let us know](https://github.com/fluent/fluentd-docs-gitbook/issues?state=open). [Fluentd](http://www.fluentd.org/) is an open-source project under [Cloud Native Computing Foundation \(CNCF\)](https://cncf.io/). All components are available under the Apache 2 License.
3844

deployment/signals.md

Lines changed: 43 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,56 @@ Forces the buffered messages to be flushed and reopens Fluentd's log. Fluentd wi
1818

1919
### SIGUSR2
2020

21+
Since v1.18, it has two features: Zero-downtime restart and Graceful reload.
22+
23+
Non-Windows:
24+
25+
| process | feature | version |
26+
| :--- | :--- | :--- |
27+
| Supervisor | Zero-downtime restart | v1.18.0 ~ |
28+
| Supervisor | Graceful reload (forwarded to all workers) | v1.9 ~ v1.17 |
29+
| Worker | Graceful reload | v1.9 ~ |
30+
31+
Windows:
32+
33+
| process | feature | version |
34+
| :--- | :--- | :--- |
35+
| Supervisor | Graceful reload (forwarded to all workers) | v1.9 ~ |
36+
| Worker | Graceful reload | v1.9 ~ |
37+
38+
#### Zero-downtime restart
39+
40+
This feature allows a complete restart of Fluentd without bringing down some input plugins, such as `in_udp` or `in_tcp`.
41+
42+
See [Zero-downtime restart](zero-downtime-restart.md) for details.
43+
44+
**Comparison with SIGHUP**
45+
46+
`SIGHUP` gracefully restarting the worker process to reload.
47+
48+
This method does not cause socket downtime, so if there is no need to restart the supervisor, `SIGHUP` is a lighter zero-downtime restart method.
49+
50+
**Comparison with Graceful reload**
51+
52+
You can still use Graceful reload feature by sending `SIGUSR2` directly to the worker processor or using [RPC](rpc.md) even after v1.18.0.
53+
54+
This allows you to reload without restarting the process, but there are some limitations.
55+
Please use zero-downtime restart or `SIGHUP` unless there is a special reason.
56+
57+
#### Graceful reload
58+
2159
Reloads the configuration file by gracefully re-constructing the data pipeline. Fluentd will try to flush the entire memory buffer at once, but will not retry if the flush fails. Fluentd will not flush the file buffer; the logs are persisted on the disk by default.
2260

23-
This signal has been supported since v1.9.0.
61+
Limitations:
62+
63+
* A change to System Configuration (`<system>`) is ignored.
64+
* All plugins must not use class variable when restarting.
2465

2566
### SIGHUP
2667

2768
Reloads the configuration file by gracefully restarting the worker process. Fluentd will try to flush the entire memory buffer at once, but will not retry if the flush fails. Fluentd will not flush the file buffer; the logs are persisted on the disk by default.
2869

29-
If you use fluentd v1.9.0 or later, use `SIGUSR2` instead.
70+
This does not cause socket downtime because the supervisor process keeps the normal sockets, as long as the socket is provided as a shared socket by [server_helper](../plugin-helper-overview/api-plugin-helper-server.md).
3071

3172
### SIGCONT
3273

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Zero-downtime restart
2+
3+
This feature allows a complete restart of Fluentd without bringing down some input plugins.
4+
5+
Supported standard input plugins are as follows.
6+
7+
| supported input plugin | version |
8+
| :--- | :--- |
9+
| in_udp | v1.18.0 |
10+
| in_tcp | v1.18.0 |
11+
| in_syslog | v1.18.0 |
12+
13+
If these input plugins are down, client applications may fail to send data.
14+
If that client does not have a resend feature, the data will be lost.
15+
16+
You can use this feature to completely restart Fluentd without losing data for these plugins.
17+
18+
## How to use this feature
19+
20+
You can use this feature in the following ways.
21+
22+
* [Signals - SIGUSR2](signals.md#sigusr2)
23+
* [RPC](rpc.md)
24+
25+
## Mechanism of zero-downtime restart
26+
27+
![zero-downtime restart mechanism](../.gitbook/assets/fluentd-zero-downtime-restart-mechanism.png)
28+
29+
1. The old supervisor receives `SIGUSR2`.
30+
2. Spawn a new supervisor.
31+
3. Take over shared sockets.
32+
4. Launch new workers, and stop old processes in parallel.
33+
* Launch new workers with [Source Only Mode](source-only-mode.md).
34+
* In addition to source-only mode, further limit the starting pluings to only those that support this feature.
35+
* Data received by the new workers are stored in the temporary buffer of source-only mode.
36+
* Send `SIGTERM` to the old supervisor after `10s` delay.
37+
5. The old supervisor stops and sends `SIGWINCH` to the new one.
38+
6. The new workers starts to run fully.
39+
* The temporary buffer of source-only mode starts to load.
40+
41+
You can configure the temporary buffer.
42+
See [Source Only Mode](source-only-mode.md) for details.
43+
44+
## Plugins: how to support this feature
45+
46+
See [How to Write Input Plugin - zero_downtime_restart_ready?](../plugin-development/api-plugin-input.md#zero_downtime_restart_ready).
47+
48+
If this article is incorrect or outdated, or omits critical information, please [let us know](https://github.com/fluent/fluentd-docs-gitbook/issues?state=open). [Fluentd](http://www.fluentd.org/) is an open-source project under [Cloud Native Computing Foundation \(CNCF\)](https://cncf.io/). All components are available under the Apache 2 License.

plugin-development/api-plugin-input.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,31 @@ router.emit(tag, time, {:foo => 'bar'})
102102

103103
## Methods
104104

105-
There are no specific methods for the Input plugins.
105+
### zero_downtime_restart_ready?
106+
107+
To support [Zero-downtime restart](../deployment/zero-downtime-restart.md), you can override this method to return `true`.
108+
109+
```ruby
110+
def zero_downtime_restart_ready?
111+
true
112+
end
113+
```
114+
115+
To do this, the following condition must be met:
116+
117+
* This plugin can run in parallel with another Fluentd.
118+
119+
This is because there is a period when the old process and the new process run in parallel during a zero-downtime restart.
120+
121+
After addressing the following considerations and ensuring there are no issues, override this method.
122+
Then, the plugin will not experience downtime with zero-downtime restart.
123+
124+
* Handling Files
125+
* When handling files, there is a possibility of conflict.
126+
* Basically, input plugins that handle files should not support Zero-downtime restart.
127+
* Handling Sockets
128+
* A socket provided as a shared socket by [server plugin helper](../plugin-helper-overview/api-plugin-helper-server.md) is shared between the old and new processes. So, such a plugin can support Zero-downtime restart.
129+
* When handling sockets on your own, be careful to avoid conflicts.
106130

107131
## Writing Tests
108132

0 commit comments

Comments
 (0)