Description
We need to refactor the shutdown sequence involving the input plugins and establish a clear shutdown semantic.
Current Design
Currently the base plugin class exposes the following methods related to shutdown:
shutdown
- only the lumberjack input plugin (of all input/filter/output/codec) call this method
- never overridden
- does nothing
finished
- called from many plugins
- never overridden
- does nothing
teardown
- called from many plugins
- overridden by many plugins for the plugin shutdown bookkeeping
finished?
- only the rackspace input plugin uses it
running?
- only the sqs input uses it
terminating?
- a hand full of plugins uses it
Current input plugin shutdown sequence description:
There are basically 3 situations:
1- all input plugins have exited normally, they have somehow reached the end of their input:
- the plugin will exit its
run
method, the plugin thread will exit - the pipeline thread will call the plugin
teardown
method
2- a SIGINT
or SIGTERM
is received and the pipeline needs to terminate the input plugins and the pipeline:
- a) SIG* signal is trapped in the pipeline thread
- b)
Pipeline#shutdown
is called from the signal handler:-
all input plugins threads are injected an exception + wakeup like this:
thread.raise(LogStash::ShutdownSignal)
thread.wakeup
-
all input plugins
teardown
methods are called in sequence -
(b) normally the plugins exit their
run
method and threads at which point the pipeline thread will again call the inputteardown
methods
-
3- an uncatched exception in the input plugin
- the plugin
teardown
method is called from the pipeline thread - the plugin is “restarted” by calling the plugin
run
method again in abegin/retry
Problems
- current semantic is not clear and shutdown handling is inconsistent throughout the plugins
- injecting a rogue exception by doing a raise in the live thread is a very bad idea and can lead to many weird problems. see this post discussing the related problems
- per the previous reason, a
LogStash::ShutdownSignal
exception can happen virtually anywhere, event in theteardown
method teardown
is called twice on the input plugins in aSIGINT
/SIGTERM
situation- thread safety issues are not clearly defined/understood in the current design
Suggested Refactor
Here's my suggestion for a first iteration cleanup. Once cleaned up, we will be able to more easily build upon it and start thinking about decoupling further the plugin execution/lifecycle from the pipeline in further iterations.
We basically need 2 things for the input plugins:
1- signal the plugin it need to stop
2- ask the plugin to do its bookkeeping once it is stopped
For this I suggest that the plugins exposes only 3 methods: stop
, stop?
and close
.
stop
is called from outside the plugin thread (from the pipeline thread) and is to ask the plugin to stop. this method must be thread-safe bacause it will be called from outside the plugin thread.stop?
returns true if the stop method was called and can be used within the plugin to verify if a stop was requested.close
is called once when the plugin is stopped to perform any final bookkeeping.
The way to know a plugin is stopped is simply by waiting the return of the plugin run
method. When the run method exits and the plugin thread exits, this is only when the close
method should be called, once.
The same semantic can be applied to the filter and output plugins which would support 2 shutdown modes:
1- the current mode of inserting a SHUTDOWN
event in the queue so that all inflight events will be processed before the filter and output plugins are closed
2- a new mode using the stop
method for an immediate shutdown in which case the inflight events will be lost when not using queue persistence, in other words, with queue persistence, we will be able to do an immediate shutdown without loosing inflight events.