-
Notifications
You must be signed in to change notification settings - Fork 25.3k
File Settings Service #88329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File Settings Service #88329
Conversation
Implement operator handlers for cluster state and ILM
Hi @grcevski, I've created a changelog YAML for you. |
@elasticsearchmachine run elasticsearch-ci/part-2 |
Also, make sure we check error state version on all error paths, before we update the error state in the cluster state.
Pinging @elastic/es-core-infra (Team:Core/Infra) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! I have several comments, mostly nits
|| threadName.contains("JVMCI-native") // GraalVM Compiler Thread | ||
|| threadName.contains("readiness-service")) { | ||
|| threadName.contains("file-settings-watcher") | ||
|| threadName.contains("FileSystemWatchService")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have many other threadpools, but we've never needed to modify this test for each of them. Can we utilize the existing ThreadPools class to manage these threads, or do something so that we don't need to modify this for every system thread that is added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that's this is tedious to modify each time. We need to explicitly add each thread that's not in any threadpool. When I worked on the readiness-service, I was advised not to use the thread pools for threads that will constantly be running, so I followed the same approach here. FileSystemWatchService
comes from the JDK, so there's no avoiding adding that one.
Is it OK if I followed up with a new PR to tackle this problem, rather than extend this PR further? I think I should be able to ensure all standalone Elasticsearch threads are registered in a way that this list doesn't have to be manually modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A followup is great
@@ -1019,6 +1036,11 @@ protected Node( | |||
modules.add(b -> b.bind(ReadinessService.class).toInstance(new ReadinessService(clusterService, environment))); | |||
} | |||
|
|||
modules.add( | |||
b -> b.bind(FileSettingsService.class) | |||
.toInstance(new FileSettingsService(clusterService, actionModule.getReservedClusterStateService(), environment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be created much higher up and then added to the existing large bind. I know it's ugly. I don't feel strongly, since this method is an enormous mess, but I think it would be nice to follow the existing pattern for now, rather than creating another one (where we start calling modules.add all over again).
// package private for testing | ||
Path operatorSettingsDir() { | ||
String dirPath = OPERATOR_DIRECTORY.get(environment.settings()); | ||
return environment.configFile().toAbsolutePath().resolve(dirPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not extract this into a member variable on construction? Then the environment member is not needed.
|
||
Path settingsDir = operatorSettingsDir(); | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trivial: not a javadoc, so no need for the extra *
} | ||
enableSettingsWatcher(settingsDir); | ||
} else { | ||
logger.info("operator settings directory [{}] not found, will watch for its creation...", settingsDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This seems more like a debug level message?
} | ||
}); | ||
} catch (Exception e) { | ||
logger.error("Error processing operator settings json file", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process doesn't throw anything checked. So why is this guarded? There is already a catch outside of this method that should handled logging the exception?
if (logger.isDebugEnabled()) { | ||
logger.debug("encountered exception watching", e); | ||
} | ||
logger.info("shutting down watcher thread"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would only happen when an exception occurs, so is this useful as an info message? Why not log the exception at error level, stating the watcher thread is shutting down? Isn't this always an exception case (not how the thread would be shutdown)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to pollute the logs with an exception trace when the watching is interrupted on the take() method call. It's how we interrupt the watcher when we stop watching, e.g. we are no longer the master node. So what I'll do here is log info when ClosedWatchServiceException is thrown and log error on any other exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have a separate catch for InterruptedException so that any unknown errors are logged at the appropriate level?
watcherThreadLatch.await(); | ||
} | ||
} catch (IOException | InterruptedException e) { | ||
logger.info("encountered exception while closing watch service", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this info instead of error or warn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll switch to warn.
null | ||
null, | ||
mock(ClusterService.class), | ||
Collections.emptyList() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trivial: all these Collections.emptyList()
can be List.of()
@@ -76,4 +80,43 @@ public PluginsAndModules info() { | |||
List.of() | |||
); | |||
} | |||
|
|||
@Override | |||
public <T> List<? extends T> loadServiceProviders(Class<T> service) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved out to a separate PR, with a unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I made a new PR here #88690. Once that's merged I'll rebase and remove this change.
I think I addressed all of the feedback and I added an integration test to ensure it all is wired up together and it's working. |
|| threadName.contains("JVMCI-native") // GraalVM Compiler Thread | ||
|| threadName.contains("readiness-service")) { | ||
|| threadName.contains("file-settings-watcher") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will tackle cleaning up this manual registration in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments. LGTM, no need for anymore review if you agree with them.
if (watcherThreadLatch != null) { | ||
watcherThreadLatch.await(); | ||
} | ||
} catch (IOException | InterruptedException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop happens on another ES thread, not one controlled here. So the interrupt should be restored for InterruptedException. (perhaps capture InterruptedException in another catch block). Note I think it is ok to not restore above since the thread is owned by this service and the thread will end after the interrupted exception is logged there.
processFileSettings(path, false); | ||
} | ||
} catch (Exception e) { | ||
logger.warn("error processing operator settings file", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this possible given processFileSettings logs any errors? It seems like the only possible error here is IOException? Could this be narrowed to that exception, and the warning message be distinguished from that on line 307? They are very similarly worded, yet this one, I don't think is specific to processing the file, it's just some IO problem happened while trying read it.
* Error subclass that is thrown when we encounter a fatal error while applying | ||
* the operator cluster state at Elasticsearch boot time. | ||
*/ | ||
public static class OperatorConfigurationError extends Error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be extending Error (and should be renamed to avoid configuration, perhaps mentioning this is a startup problem in the name). This can extend RuntimeException so it can be propagated easily.
if (e != null) { | ||
// If we encountered an exception trying to apply the operator state at | ||
// startup time, we throw an error to force Elasticsearch to exit. | ||
if (onStartup) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is a little difficult to reason about since it depends on this boolean (but the error here is then not possible outside of startup). Perhaps a more direct approach would be to pass a Consumer so the two callers can decide how to handle it.
private volatile boolean active = false; | ||
private volatile boolean initialState = true; | ||
|
||
public static final Setting<String> OPERATOR_DIRECTORY = Setting.simpleString( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this configurable? I think we could start with it hardcoded? I much prefer it is always the same, so that we know it will always be directly under the config dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I'll fix it.
* - any changes to files inside the operator directory if it exists, filtering for settings.json | ||
*/ | ||
try { | ||
this.watchService = PathUtils.getDefaultFileSystem().newWatchService(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than grab the default filesystem, it is probably better to grab it from the config dir?
Watches a directory under Elasticsearch config for changes to a file called settings.json. This is a JSON file describing the settings that the operator wants to set that will be immutable through the REST API. Any changes to these settings will only be allowed through modifying the file. As configured by default we watch for a path operator/settings.json under the config directory. The name of the directory operator can be changed via settings.
The File Settings Service will notice updates to the file and perform an update. The operator directory doesn't have to exist, the file setting service will notice when it's created. Recommended way to change operator file based settings is through modifying the settings.json file in a brand new location and then performing a symlink under the Elasticsearch config directory.
Relates to #86224