Skip to content

File Settings Service #88329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Jul 26, 2022
Merged

File Settings Service #88329

merged 39 commits into from
Jul 26, 2022

Conversation

grcevski
Copy link
Contributor

@grcevski grcevski commented Jul 6, 2022

Watches a directory under Elasticsearch config for changes to a file called settings.json. This is a JSON file describing the settings that the operator wants to set that will be immutable through the REST API. Any changes to these settings will only be allowed through modifying the file. As configured by default we watch for a path operator/settings.json under the config directory. The name of the directory operator can be changed via settings.

The File Settings Service will notice updates to the file and perform an update. The operator directory doesn't have to exist, the file setting service will notice when it's created. Recommended way to change operator file based settings is through modifying the settings.json file in a brand new location and then performing a symlink under the Elasticsearch config directory.

Relates to #86224

@grcevski grcevski added >feature :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v8.4.0 labels Jul 6, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @grcevski, I've created a changelog YAML for you.

@grcevski grcevski marked this pull request as ready for review July 20, 2022 17:05
@grcevski
Copy link
Contributor Author

@elasticsearchmachine run elasticsearch-ci/part-2

Also, make sure we check error state version
on all error paths, before we update the
error state in the cluster state.
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! I have several comments, mostly nits

|| threadName.contains("JVMCI-native") // GraalVM Compiler Thread
|| threadName.contains("readiness-service")) {
|| threadName.contains("file-settings-watcher")
|| threadName.contains("FileSystemWatchService")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have many other threadpools, but we've never needed to modify this test for each of them. Can we utilize the existing ThreadPools class to manage these threads, or do something so that we don't need to modify this for every system thread that is added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that's this is tedious to modify each time. We need to explicitly add each thread that's not in any threadpool. When I worked on the readiness-service, I was advised not to use the thread pools for threads that will constantly be running, so I followed the same approach here. FileSystemWatchService comes from the JDK, so there's no avoiding adding that one.

Is it OK if I followed up with a new PR to tackle this problem, rather than extend this PR further? I think I should be able to ensure all standalone Elasticsearch threads are registered in a way that this list doesn't have to be manually modified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A followup is great

@@ -1019,6 +1036,11 @@ protected Node(
modules.add(b -> b.bind(ReadinessService.class).toInstance(new ReadinessService(clusterService, environment)));
}

modules.add(
b -> b.bind(FileSettingsService.class)
.toInstance(new FileSettingsService(clusterService, actionModule.getReservedClusterStateService(), environment))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be created much higher up and then added to the existing large bind. I know it's ugly. I don't feel strongly, since this method is an enormous mess, but I think it would be nice to follow the existing pattern for now, rather than creating another one (where we start calling modules.add all over again).

// package private for testing
Path operatorSettingsDir() {
String dirPath = OPERATOR_DIRECTORY.get(environment.settings());
return environment.configFile().toAbsolutePath().resolve(dirPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not extract this into a member variable on construction? Then the environment member is not needed.


Path settingsDir = operatorSettingsDir();

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: not a javadoc, so no need for the extra *

}
enableSettingsWatcher(settingsDir);
} else {
logger.info("operator settings directory [{}] not found, will watch for its creation...", settingsDir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This seems more like a debug level message?

}
});
} catch (Exception e) {
logger.error("Error processing operator settings json file", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process doesn't throw anything checked. So why is this guarded? There is already a catch outside of this method that should handled logging the exception?

if (logger.isDebugEnabled()) {
logger.debug("encountered exception watching", e);
}
logger.info("shutting down watcher thread");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would only happen when an exception occurs, so is this useful as an info message? Why not log the exception at error level, stating the watcher thread is shutting down? Isn't this always an exception case (not how the thread would be shutdown)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to pollute the logs with an exception trace when the watching is interrupted on the take() method call. It's how we interrupt the watcher when we stop watching, e.g. we are no longer the master node. So what I'll do here is log info when ClosedWatchServiceException is thrown and log error on any other exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a separate catch for InterruptedException so that any unknown errors are logged at the appropriate level?

watcherThreadLatch.await();
}
} catch (IOException | InterruptedException e) {
logger.info("encountered exception while closing watch service", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this info instead of error or warn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll switch to warn.

null
null,
mock(ClusterService.class),
Collections.emptyList()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: all these Collections.emptyList() can be List.of()

@@ -76,4 +80,43 @@ public PluginsAndModules info() {
List.of()
);
}

@Override
public <T> List<? extends T> loadServiceProviders(Class<T> service) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved out to a separate PR, with a unit test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I made a new PR here #88690. Once that's merged I'll rebase and remove this change.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:05
@grcevski
Copy link
Contributor Author

I think I addressed all of the feedback and I added an integration test to ensure it all is wired up together and it's working.

|| threadName.contains("JVMCI-native") // GraalVM Compiler Thread
|| threadName.contains("readiness-service")) {
|| threadName.contains("file-settings-watcher")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will tackle cleaning up this manual registration in a separate PR.

@grcevski grcevski requested review from rjernst and ChrisHegarty July 23, 2022 02:07
Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments. LGTM, no need for anymore review if you agree with them.

if (watcherThreadLatch != null) {
watcherThreadLatch.await();
}
} catch (IOException | InterruptedException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop happens on another ES thread, not one controlled here. So the interrupt should be restored for InterruptedException. (perhaps capture InterruptedException in another catch block). Note I think it is ok to not restore above since the thread is owned by this service and the thread will end after the interrupted exception is logged there.

processFileSettings(path, false);
}
} catch (Exception e) {
logger.warn("error processing operator settings file", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this possible given processFileSettings logs any errors? It seems like the only possible error here is IOException? Could this be narrowed to that exception, and the warning message be distinguished from that on line 307? They are very similarly worded, yet this one, I don't think is specific to processing the file, it's just some IO problem happened while trying read it.

* Error subclass that is thrown when we encounter a fatal error while applying
* the operator cluster state at Elasticsearch boot time.
*/
public static class OperatorConfigurationError extends Error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be extending Error (and should be renamed to avoid configuration, perhaps mentioning this is a startup problem in the name). This can extend RuntimeException so it can be propagated easily.

if (e != null) {
// If we encountered an exception trying to apply the operator state at
// startup time, we throw an error to force Elasticsearch to exit.
if (onStartup) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is a little difficult to reason about since it depends on this boolean (but the error here is then not possible outside of startup). Perhaps a more direct approach would be to pass a Consumer so the two callers can decide how to handle it.

private volatile boolean active = false;
private volatile boolean initialState = true;

public static final Setting<String> OPERATOR_DIRECTORY = Setting.simpleString(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this configurable? I think we could start with it hardcoded? I much prefer it is always the same, so that we know it will always be directly under the config dir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I'll fix it.

* - any changes to files inside the operator directory if it exists, filtering for settings.json
*/
try {
this.watchService = PathUtils.getDefaultFileSystem().newWatchService();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than grab the default filesystem, it is probably better to grab it from the config dir?

@grcevski grcevski merged commit f8a8df4 into elastic:main Jul 26, 2022
@grcevski grcevski deleted the operator/file_service branch July 26, 2022 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >feature Team:Core/Infra Meta label for core/infra team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants