Reserved cluster state service #88527

grcevski · 2022-07-13T21:35:57Z

This PR adds the reserved cluster state controller, which part of the File based settings PR (#86224).

This is a reusable component that has number of Reserved State Handler components that know how to update parts of the Elasticsearch state. This component is only used by the File Settings Service at the moment, but the intent is for modules and plugins to be able to call it to save settings and entities in the cluster state that cannot be changed by the end user. For example work on using this component to update state from plugins/modules, please see #88341

Relates to #86224.

elasticmachine · 2022-07-13T21:35:59Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

grcevski · 2022-07-13T22:33:42Z

server/src/main/java/org/elasticsearch/ExceptionsHelper.java

+     * @param errorListener a listener function that will be called with every unwrapped throwable we find
+     * @param limit after how many encountered throwables should we stop unwrapping. Prevents stack overflows. 10 is reasonable max.
+     */
+    public static void unwrap(Throwable t, Consumer<Throwable> errorListener, int limit) {


I added this helper because I couldn't find a suitable one here that would return each throwable message. When we parse JSON, the messages of errors are reported on each throwable as it goes down in subsections of the JSON body.

rjernst

Looks pretty good. I left some thoughts.

rjernst · 2022-07-13T21:57:09Z

server/src/main/java/org/elasticsearch/ExceptionsHelper.java

+     * @param errorListener a listener function that will be called with every unwrapped throwable we find
+     * @param limit after how many encountered throwables should we stop unwrapping. Prevents stack overflows. 10 is reasonable max.
+     */
+    public static void unwrap(Throwable t, Consumer<Throwable> errorListener, int limit) {


There are already a lot of random unwrap/tostring type helpers here. Could we avoid adding another?

This seems to be a helper for iterating through exception causes. Yet the calls I see later in this PR only print the message of the causes, not the stack trace. If we are going to preserve the exception chain, then preserving the entire stacktrace seems better, especially since the exception will only be consumed in exceptional cases by cloud (misconfiguration), or through the Elasticsearch log (plugin coding error for packaged entities). If that is the case, we already have a method for printing the stack trace of a Throwable to a String in ExceptionsHelper.

rjernst · 2022-07-13T22:30:23Z

server/src/main/java/org/elasticsearch/reservedstate/service/HandlerDependencyManager.java

+     * @param handlerNames Names of handlers found in the {@link ReservedStateChunk}
+     * @return
+     */
+    static LinkedHashSet<String> orderedStateHandlers(Map<String, ReservedClusterStateHandler<?>> handlers, Set<String> handlerNames) {


This seems to be the only method exposed on this class, and it is static. Can this method be moved to the ReservedClusterStateController, which looks like the only caller?

Sure, I split it off so that I can reduce the size of the controller file. I can bring it back.

rjernst · 2022-07-13T22:33:20Z

...er/src/main/java/org/elasticsearch/reservedstate/service/ReservedClusterStateController.java

+ * <p>
+ * This class contains the logic about validation, ordering and applying of
+ * the cluster state specified in a file or through plugins/modules. Reserved cluster state
+ * cannot be modified throught the REST APIs, only through this controller class.


typo: throught -> through

rjernst · 2022-07-13T22:40:05Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedStateChunk.java

+import java.util.Map;
+
+/**
+ * Tuple class containing the cluster state to be saved and reserved and the version info


Not a tuple? Maybe say "A holder for ..."?

rjernst · 2022-07-13T22:44:18Z

...c/test/java/org/elasticsearch/reservedstate/service/ReservedClusterStateControllerTests.java

+    }
+
+    public void testHandlerOrdering() {
+        ReservedClusterStateHandler<Map<String, Object>> oh1 = new ReservedClusterStateHandler<>() {


Maybe create a helper method to dynamically create these (taking name + varargs of deps) so the ordering tests can be much smaller?

rjernst · 2022-07-13T22:56:59Z

...er/src/main/java/org/elasticsearch/reservedstate/service/ReservedStateErrorTaskExecutor.java

+
+    @Override
+    public void clusterStatePublished(ClusterState newClusterState) {
+        logger.info("Wrote new error state in immutable metadata");


How do you anticipate this info message being useful in debugging? (also, immutable -> reserved cluster state?)

I think it's useful to know that we've recorded the error state in the cluster state. Updating the state can fail here, maybe we are not the master anymore, or ES terminated before we got to write the error state. By seeing the message in the logs, we definitely know we wrote what's wrong with the file settings. Thanks for catching the wording bug.

My thoughts are centered on:

it's an info level message. If it's important, and it's related to an error, shouldn't it be at least warn if not error level?

It doesn't have any supporting information. The message alone doesn't convey anything useful, other than "something happened". If a user is looking through their logs and sees this, is that useful to them? Can they do anything about it?

Yeah that's a valid point, it's not useful to the end user as is the confirmation of new cluster state being set. I'll change this to a debug message instead, so we can use it while working on the feature.

rjernst · 2022-07-13T23:06:40Z

...er/src/main/java/org/elasticsearch/reservedstate/service/ReservedClusterStateController.java

+ * the cluster state specified in a file or through plugins/modules. Reserved cluster state
+ * cannot be modified throught the REST APIs, only through this controller class.
+ */
+public class ReservedClusterStateController {


nit: we've typically used the "Service" suffix for other similar "things" in a node, any reason not to use that here in place of Controller?

rjernst · 2022-07-13T23:47:02Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedStateUpdateTask.java

+            try {
+                Set<String> existingKeys = keysForHandler(existingMetadata, handlerName);
+                TransformState transformState = handler.transform(reservedState.get(handlerName), new TransformState(state, existingKeys));
+                state = transformState.state();


Each handler is create it's own ClusterState. Can we do this more efficiently? Building cluster state is not cheap. Could we instead create a ClusterState.Builder here (or perhaps just the customs and metadata builders), and that is passed to the transform method? I think the TransformState would not be needed, the method could just return the new set of keys.

I can't fully use ClusterState.Builder without more significant refactor of how the TransportActions we are wrapping for this work. All of the transport actions we use for the reserved state handlers take cluster state in and return a new state back. However, I really like the idea of using the builder instead of the TransformState. This way we can remove that TransformState class, we can start using the builder and then we can slowly improve the transport actions to maybe work with a builder in the future, to improve efficiency.

I think I said that too quickly, I think using the builder at the moment without refactoring all of the transport actions will be more overhead. I can out transform state and use Tuple, but then I sprinkle those v1(), v2() everywhere instead of the named variables.

I'm ok with waiting on this optimization, but I do think it is something we should look into. Creating cluster state is relatively expensive.

I'll follow-up with a PR to optimize this after I check that it's OK to refactor the Transport node actions to use a builder.

rjernst · 2022-07-13T23:52:46Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedStateUpdateTask.java

+
+                throw new ReservedStateVersion.IncompatibleVersionException(
+                    format(
+                        "Not updating error state because version [%s] is less or equal to the last state error version [%s]",


Why is this an error? Isn't this part of the idempotent nature of the reserved state design?

I added this because of what might potentially happen when plugins/modules are using the controller. If there are bugs in how they call the controller and they repeatedly send the same erroneous content to be reserved, we'll be spamming the cluster with error cluster state updates. We simply check if the version metadata of the new erroneous content has previously been seen, and if we have seen it, we log an error in the logs, but don't send cluster state update.

What about if the same operator settings file has an update triggered, but for the same version? Wouldn't this cause an internal error, when it should be a no-op?

Separate from this PR, I think we should consider how plugins can use this functionality without directly calling the reserved state service. As you point out, it can be error prone, so perhaps we can redesign the pluggable portion to declaratively specify their resources, and the server is actually what does the registration.

Fair point, I throw a specific exception here that would be caught and handled differently by the caller of the service, but it's not ideal, it should be a no-op and a log message only. I'll fix this.

Separate from this PR, I think we should consider how plugins can use this functionality without directly calling the reserved state service. As you point out, it can be error prone, so perhaps we can redesign the pluggable portion to declaratively specify their resources, and the server is actually what does the registration.

I like this idea, perhaps though SPI we can ask all plugins if they want to declare any state chunks that need to be applied and we call the update service internally. This way no plugins will have access to the update service and we'll apply it using our rules.

rjernst · 2022-07-13T23:55:59Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedStateVersion.java

+        PARSER.declareString(ConstructingObjectParser.constructorArg(), COMPATIBILITY);
+    }
+
+    public static ReservedStateVersion parse(XContentParser parser, Void v) {


Could the implementation specific Void here could be hidden from the signature? Remove from the method and pass null to PARSER?

grcevski · 2022-07-14T21:26:45Z

Thanks again for the review @rjernst, I removed that extra exception and changed such that we return the original cluster state reference to avoid unnecessary update.

rjernst

LGTM

rjernst · 2022-07-14T22:00:13Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedClusterStateService.java

-                    )
+            logger.warn(
+                () -> format(
+                    "Cluster state version [%s] is not compatible with this Elasticsearch node",


Perhaps this should mention the reserved state name?

grcevski · 2022-07-15T00:31:35Z

@elasticmachine update branch

* upstream/master: (2974 commits) Reserved cluster state service (elastic#88527) Add transport action immutable state checks (elastic#88491) Remove suggest flag from index stats docs (elastic#85479) Polling cluster formation state for master-is-stable health indicator (elastic#88397) Add test execution guide in yamlRestTest asciidoc (elastic#88490) Add troubleshooting guide for corrupt repository (elastic#88391) [Transform] Finetune Schedule to be less noisy on retry and retry slower (elastic#88531) Updatable API keys - auto-update legacy RDs (elastic#88514) Fix typo in TransportForceMergeAction and TransportClearIndicesCacheA… (elastic#88064) Fixed NullPointerException on bulk request (elastic#88358) Avoid needless index metadata builders during reroute (elastic#88506) Set metadata on request in API key noop test (elastic#88507) Fix passing positional args to ES in Docker (elastic#88502) Improve description for task api detailed param (elastic#88493) Support cartesian shape with doc values (elastic#88487) Promote usage of Subjects in Authentication class (elastic#88494) Add CCx 2.0 feature flag (elastic#88451) Reword the watcher 'always' and 'never' condition docs (elastic#86105) Simplify azure discovery installation docs (elastic#88404) Breakup FIPS CI testing jobs ... # Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java # x-pack/plugin/mapper-aggregate-metric/src/main/java/org/elasticsearch/xpack/aggregatemetric/mapper/AggregateDoubleMetricFieldMapper.java

Nikola Grcevski added 3 commits July 13, 2022 15:04

Immutable cluster state controller

f167049

Merge master

5c9de34

Refactor with smaller files

b6adf1d

grcevski added >non-issue :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v8.4.0 labels Jul 13, 2022

grcevski commented Jul 13, 2022

View reviewed changes

rjernst reviewed Jul 13, 2022

View reviewed changes

Apply review feedback

74dac89

grcevski requested a review from rjernst July 14, 2022 18:13

Apply more review feedback

a568668

rjernst approved these changes Jul 14, 2022

View reviewed changes

grcevski changed the title ~~Reserved cluster state Controller~~ Reserved cluster state service Jul 14, 2022

Add namespace to log messages

439c42a

Merge branch 'master' into operator/just_controller

cb48f58

grcevski merged commit 62e8ca9 into elastic:master Jul 15, 2022

Reserved cluster state service #88527

Reserved cluster state service #88527

Uh oh!

Conversation

grcevski commented Jul 13, 2022

Uh oh!

elasticmachine commented Jul 13, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grcevski Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grcevski Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grcevski commented Jul 14, 2022

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grcevski commented Jul 15, 2022

Uh oh!

Uh oh!

grcevski Jul 14, 2022 •

edited

Loading

grcevski Jul 14, 2022 •

edited

Loading