[ML] add new snapshot upgrader API for upgrading older snapshots #64665

benwtrent · 2020-11-05T18:27:30Z

This new API provides a way for users to upgrade their own anomaly job
model snapshots.

To upgrade a snapshot the following is done:

Open a native process given the job id and the desired snapshot id
load the snapshot to the process
write the snapshot again from the native task (now updated via the
native process)

This new API provides a way for users to upgrade their own anomaly job model snapshots. To upgrade a snapshot the following is done: - Open a native process given the job id and the desired snapshot id - load the snapshot to the process - write the snapshot again from the native task (now updated via the native process) closes elastic#64154

elasticmachine · 2020-11-05T18:27:32Z

Pinging @elastic/ml-core (:ml)

benwtrent

There was one thing in this change that was concerning to me.

The model size stats seem to simply disappear when upgrading the snapshot. This is probably because the CResourceMonitor::createMemoryUsageReport simply generates the report based on the current job usage (which is effectively nothing).

This strikes me as strange as revert actually does change the model size stats for the job...

Here were my test results:
BEFORE

"model_size_stats" : {
        "job_id" : "largish-kibana-sample-data",
        "result_type" : "model_size_stats",
        "model_bytes" : 4972992,
        "peak_model_bytes" : 2495692,
        "model_bytes_exceeded" : 0,
        "model_bytes_memory_limit" : 524288000,
        "total_by_field_count" : 130,
        "total_over_field_count" : 0,
        "total_partition_field_count" : 129,
        "bucket_allocation_failures_count" : 0,
        "memory_status" : "ok",
        "categorized_doc_count" : 0,
        "total_category_count" : 0,
        "frequent_category_count" : 0,
        "rare_category_count" : 0,
        "dead_category_count" : 0,
        "failed_category_count" : 0,
        "categorization_status" : "ok",
        "log_time" : 1604588763336,
        "timestamp" : 1607272200000
      },

AFTER

"model_size_stats": {
                "job_id": "largish-kibana-sample-data",
                "result_type": "model_size_stats",
                "model_bytes": 11954,
                "peak_model_bytes": 0,
                "model_bytes_exceeded": 0,
                "model_bytes_memory_limit": 524288000,
                "total_by_field_count": 130,
                "total_over_field_count": 0,
                "total_partition_field_count": 129,
                "bucket_allocation_failures_count": 0,
                "memory_status": "ok",
                "categorized_doc_count": 0,
                "total_category_count": 0,
                "frequent_category_count": 0,
                "rare_category_count": 0,
                "dead_category_count": 0,
                "failed_category_count": 0,
                "categorization_status": "ok",
                "log_time": 1604591566986,
                "timestamp": 1607272200000
            },

There might need to be changes on the C++ side so that model size states are not regenerated on snapshot save (especially when persist in foreground is used).

@droberts195 let me know what you think.

benwtrent · 2020-11-05T18:28:42Z

client/rest-high-level/src/test/java/org/elasticsearch/client/MachineLearningIT.java

+        String snapshotId = "1541587919";
+
+        createModelSnapshot(jobId, snapshotId, Version.V_7_0_0);
+        //TODO add a true state from the past somehow


I am not 100% how to do this. Right now this test effectively just checks that the parameters are parsed and sent as the resulting error indicates that we at least tried to load the model snapshot.

The solution might be to get an actual snapshot, and then manually updating the doc so that the min_version is old.

The solution might be to get an actual snapshot, and then manually updating the doc so that the min_version is old.

Yes, in terms of testing the infrastructure that would be a good way. Run a simple job like farequote, update the model snapshot document after the job is closed to have a min_version from the previous major, then upgrade it. Not sure this needs to be done in the HRLC tests though - for such a complex test the native multi node tests seems like the single place to do it. I am happy to leave this test as-is, just testing the parameter passing.

In terms of testing actual upgrade, it could be done in the BWC tests. We could have a BWC test (Java, not YAML) that does nothing when the old cluster is on the same major, but when the old cluster is on a different major it opens/runs/closes a job in the old cluster, then upgrades its model snapshot in the fully upgraded cluster (and does nothing in the mixed cluster).

yeah, I have a BWC test class covering this case.

benwtrent · 2020-11-05T18:30:09Z

...core/src/main/java/org/elasticsearch/xpack/core/ml/action/UpgradeJobModelSnapshotAction.java

+        super(NAME, Response::new);
+    }
+
+    public static class Request extends MasterNodeRequest<Request> implements ToXContentObject {


This is a master node action as we always want the latest cluster state information.

benwtrent · 2020-11-05T18:31:51Z

...src/main/java/org/elasticsearch/xpack/core/ml/job/snapshot/upgrade/SnapshotUpgradeState.java

+
+public enum SnapshotUpgradeState implements Writeable {
+
+    READING_NEW_STATE, STOPPED, FAILED;


I didn't opt for a "writing_old_state" or an "opened" state as neither really conveyed any information. If the state is null, we know that either it is not assigned to a node or it is assigned and still loading the old snapshot.

Once we are in the reading_new_state, then that indicates that we have reached the point of no return and any failure from that state indicates a corrupted job model snapshot.

reading_new_state is very much from the perspective of the Java code rather than the end user. As an end user who doesn't even know that the code is split between Java and C++ I would have thought writing_new_state makes more sense. Or saving_new_state would be a compromise that makes sense to both end users and Java developers.

I would also introduce a reading_old_state or loading_old_state enum value that can be used in stats and API responses instead of null. We went through that cycle with job states. Initially there was no opening state, because a null task state basically meant that. But then we found it was nicer to have a specific enum value for it and translate null to that enum value in some places. Even if it's not used anywhere initially it will avoid BWC code to add it to the enum from the outset.

benwtrent · 2020-11-05T18:32:55Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

@@ -678,8 +684,8 @@ protected Clock getClock() {
            }
        } else {
            mlController = new DummyController();
-            autodetectProcessFactory = (job, autodetectParams, executorService, onProcessCrash) ->
-                    new BlackHoleAutodetectProcess(job.getId(), onProcessCrash);
+            autodetectProcessFactory = (pipelineId, job, autodetectParams, executorService, onProcessCrash) ->


pipelineId is for renaming the resulting file pipeline. See below comments for further explanation

benwtrent · 2020-11-05T18:35:24Z

.../src/main/java/org/elasticsearch/xpack/ml/action/TransportUpgradeJobModelSnapshotAction.java

+
+        if (state.nodes().getMaxNodeVersion().after(state.nodes().getMinNodeVersion())) {
+            listener.onFailure(ExceptionsHelper.conflictStatusException(
+                "Cannot upgrade job [{}] snapshot [{}] as not all nodes are on version {}. All nodes must be the same version",


This just eliminates the small edge cases of requiring job node assignment to take into account node version. These processes are short lived, and restricting the cluster to not be a mixed cluster is a sane limitation. Especially since this API is meant to be used right before upgrading to the next major version.

benwtrent · 2020-11-05T18:44:37Z

...n/java/org/elasticsearch/xpack/ml/job/process/autodetect/NativeAutodetectProcessFactory.java

+    public AutodetectProcess createAutodetectProcess(String pipelineId,
+                                                     Job job,


One of the requirements was that this upgrade be possible WHILE the referenced job is running. Consequently, the snapshot upgrade task and the job task COULD be assigned to the same node. If the pipeline ID was not given directly, this would cause a file name conflict.

Admittedly, there is already this "unique pipeline flag" but that is a long value. I thought it would be nice to include the snapshot ID directly in the pipeline name. It makes the resulting logs very easy to investigate snapshot upgrader issues by looking for <job_id>-<snapshot_id>

benwtrent · 2020-11-05T18:46:12Z

...elasticsearch/xpack/ml/job/process/autodetect/output/JobSnapshotUpgraderResultProcessor.java

+ * <p>
+ * This is a single purpose result processor and only handles snapshot writes
+ */
+public class JobSnapshotUpgraderResultProcessor {


I created a new processor here as I didn't want to chance ANY other result being written back. This just protects us from inadvertently updating the job results/state when we didn't mean to.

I possibly could have had an AbstractResultProcessor class, but shared code was so little, it didn't really seem worth it.

benwtrent · 2020-11-05T18:47:18Z

...src/main/java/org/elasticsearch/xpack/ml/job/snapshot/upgrader/SnapshotUpgradePredicate.java

+        if (persistentTask == null) {
+            isCompleted = true;
+            return true;
+        }


In all my testing, the task is only null when it has been removed from cluster state. Since this predicate runs AFTER we have confirmed the task has been added to state (the start task API), it is good to assume that null is removal and thus is completion.

benwtrent · 2020-11-05T18:48:16Z

.../main/java/org/elasticsearch/xpack/ml/job/snapshot/upgrader/SnapshotUpgradeTaskExecutor.java

+        if (SnapshotUpgradeState.READING_NEW_STATE.equals(jobState)) {
+            deleteSnapshotAndFailTask(task, params.getJobId(), params.getSnapshotId());
+            return;


If we are assigned to a new node while reading_new_state, the snapshot could be corrupted since the files are being overwritten one by one.

Consequently, we audit, log and then delete the snapshot as it is unusable anyways.

It MIGHT be better to add a flag to the snapshot that says "bad snapshot". But, the way to recover here would be to delete the job model state and then restore from an elasticsearch snapshot...Up for debate.

benwtrent · 2020-11-05T18:51:14Z

x-pack/qa/rolling-upgrade/src/test/java/org/elasticsearch/upgrades/MlJobSnapshotUpgradeIT.java

+        List<ModelSnapshot> snapshots = getModelSnapshots(job.getId(), snapshot.getSnapshotId()).snapshots();
+        assertThat(snapshots, hasSize(1));
+        assertThat(snapshot.getLatestRecordTimeStamp(), equalTo(snapshots.get(0).getLatestRecordTimeStamp()));
+
+        // Does the snapshot still work?
+        assertThat(hlrc.getJobStats(new GetJobStatsRequest(JOB_ID), RequestOptions.DEFAULT)
+                .jobStats()
+                .get(0)
+                .getDataCounts().getLatestRecordTimeStamp(),
+            greaterThan(snapshot.getLatestRecordTimeStamp()));


After backport I want to add a mixed cluster test (to make sure the mixed node error throws) and I want to verify that the min_version is updated on the new snapshot.

Right now, since 8.x does not support upgrades from 6.x, that is not possible here. But in 7.x, it will be good to test that min_version gets adjusted.

benwtrent · 2020-11-05T18:56:59Z

After this is merged and backported, another API should be added to check the stats of snapshot upgrades and changes to the deprecation API to include when a snapshot is too old to run in the next major.

droberts195 · 2020-11-09T13:00:02Z

Here were my test results:
BEFORE

"model_size_stats" : {
       "job_id" : "largish-kibana-sample-data",
       "result_type" : "model_size_stats",
       "model_bytes" : 4972992,
       "peak_model_bytes" : 2495692,
       "model_bytes_exceeded" : 0,
       "model_bytes_memory_limit" : 524288000,
       "total_by_field_count" : 130,
       "total_over_field_count" : 0,
       "total_partition_field_count" : 129,
       "bucket_allocation_failures_count" : 0,
       "memory_status" : "ok",
       "categorized_doc_count" : 0,
       "total_category_count" : 0,
       "frequent_category_count" : 0,
       "rare_category_count" : 0,
       "dead_category_count" : 0,
       "failed_category_count" : 0,
       "categorization_status" : "ok",
       "log_time" : 1604588763336,
       "timestamp" : 1607272200000
     },

AFTER

"model_size_stats": {
               "job_id": "largish-kibana-sample-data",
               "result_type": "model_size_stats",
               "model_bytes": 11954,
               "peak_model_bytes": 0,
               "model_bytes_exceeded": 0,
               "model_bytes_memory_limit": 524288000,
               "total_by_field_count": 130,
               "total_over_field_count": 0,
               "total_partition_field_count": 129,
               "bucket_allocation_failures_count": 0,
               "memory_status": "ok",
               "categorized_doc_count": 0,
               "total_category_count": 0,
               "frequent_category_count": 0,
               "rare_category_count": 0,
               "dead_category_count": 0,
               "failed_category_count": 0,
               "categorization_status": "ok",
               "log_time": 1604591566986,
               "timestamp": 1607272200000
           },

There might need to be changes on the C++ side so that model size states are not regenerated on snapshot save (especially when persist in foreground is used).

@droberts195 let me know what you think.

I agree there need to be changes on the C++ side.

For model_bytes, I think the w control message should trigger a full recalculation of model memory before persisting state, i.e. iterate all the resources that model::CResourceMonitor knows about and update their memory usage. This will result in an accurate figure for the latest data structures.

For peak_model_bytes it's slightly worrying that it gets reset to zero, as it's supposed to be taken from a persistent counter. It suggests we might not be restoring the values of persistent counters correctly. If so this is a bug that needs to be fixed soon.

@edsavage please could you investigate those two things.

droberts195

Thanks for writing this extremely complicated yet also tedious functionality.

There's a lot of code so I haven't reviewed it all in detail, but have left an initial set of comments. The biggest one is that we need to think about how to avoid excessive complexity in the Kibana migration assistant that will have to use this code eventually.

droberts195 · 2020-11-09T17:05:37Z

client/rest-high-level/src/test/java/org/elasticsearch/client/MachineLearningIT.java

+        String snapshotId = "1541587919";
+
+        createModelSnapshot(jobId, snapshotId, Version.V_7_0_0);
+        //TODO add a true state from the past somehow


The solution might be to get an actual snapshot, and then manually updating the doc so that the min_version is old.

Yes, in terms of testing the infrastructure that would be a good way. Run a simple job like farequote, update the model snapshot document after the job is closed to have a min_version from the previous major, then upgrade it. Not sure this needs to be done in the HRLC tests though - for such a complex test the native multi node tests seems like the single place to do it. I am happy to leave this test as-is, just testing the parameter passing.

In terms of testing actual upgrade, it could be done in the BWC tests. We could have a BWC test (Java, not YAML) that does nothing when the old cluster is on the same major, but when the old cluster is on a different major it opens/runs/closes a job in the old cluster, then upgrades its model snapshot in the fully upgraded cluster (and does nothing in the mixed cluster).

droberts195 · 2020-11-09T17:07:34Z

...high-level/src/test/java/org/elasticsearch/client/documentation/MlClientDocumentationIT.java

+                UpgradeJobModelSnapshotResponse response = client.machineLearning().upgradeJobSnapshot(request, RequestOptions.DEFAULT);
+                // end::upgrade-job-model-snapshot-execute
+            } catch (ElasticsearchException ex) {
+                // TODO have a true snapshot in the past to upgrade?


As above, this will be complex and expensive, and I am not sure that is justified for the docs tests. We can do it once as part of the native multi node tests, but burning that CPU many times in a full CI run seems unjustified.

yeah, I have a BWC test class covering this case.

In that case I think you should remove the TODO from here and instead have a comment to say that this is just checking syntax because actual upgrade is covered elsewhere.

droberts195 · 2020-11-09T17:10:12Z

docs/java-rest/high-level/ml/upgrade-job-model-snapshot.asciidoc

+--------------------------------------------------
+<1> The job that owns the snapshot
+<2> The snapshot id to upgrade
+<3> The time out of the request


As an end user I would be interested in what this means if wait_for_completion=false, and what the default is.

droberts195 · 2020-11-09T17:21:16Z

...src/main/java/org/elasticsearch/xpack/core/ml/job/snapshot/upgrade/SnapshotUpgradeState.java

+
+public enum SnapshotUpgradeState implements Writeable {
+
+    READING_NEW_STATE, STOPPED, FAILED;


reading_new_state is very much from the perspective of the Java code rather than the end user. As an end user who doesn't even know that the code is split between Java and C++ I would have thought writing_new_state makes more sense. Or saving_new_state would be a compromise that makes sense to both end users and Java developers.

I would also introduce a reading_old_state or loading_old_state enum value that can be used in stats and API responses instead of null. We went through that cycle with job states. Initially there was no opening state, because a null task state basically meant that. But then we found it was nicer to have a specific enum value for it and translate null to that enum value in some places. Even if it's not used anywhere initially it will avoid BWC code to add it to the enum from the outset.

droberts195 · 2020-11-09T17:22:15Z

...main/java/org/elasticsearch/xpack/core/ml/job/snapshot/upgrade/SnapshotUpgradeTaskState.java

+        try {
+            return PARSER.parse(parser, null);
+        } catch (IOException e) {
+            throw new RuntimeException(e);


Did you consider UncheckedIOException?

droberts195 · 2020-11-09T17:28:17Z

.../src/main/java/org/elasticsearch/xpack/ml/action/TransportUpgradeJobModelSnapshotAction.java

+                if (response.result.getMinVersion().major >= UPGRADE_FROM_MAJOR) {
+                    listener.onFailure(ExceptionsHelper.conflictStatusException(
+                        "Cannot upgrade job [{}] snapshot [{}] as it is already compatible with current major version {}",
+                        request.getJobId(),
+                        request.getSnapshotId(),
+                        UPGRADE_FROM_MAJOR));
+                    return;
+                }


I would do the check on the exact version rather than just the major. Although it shouldn't be necessary, upgrading the format from e.g. 7.0 format to 7.11 format might be a useful piece of functionality to have in the future to work around some other bug.

droberts195 · 2020-11-09T17:31:35Z

.../src/main/java/org/elasticsearch/xpack/ml/action/TransportUpgradeJobModelSnapshotAction.java

+                    listener.onFailure(ExceptionsHelper.conflictStatusException(
+                        "Cannot upgrade snapshot [{}] for job [{}] as it is the current primary job snapshot",
+                        request.getSnapshotId(),
+                        request.getJobId()
+                    ));


This means extra complication for the Kibana upgrade assistant though. For every model snapshot that exists that is too old it will now have to recommend one of two possible courses of action, depending on whether the snapshot is the active one or not. Opening and closing a job normally without sending it any data doesn't rewrite the snapshot, so the user would also have to feed some data to actually change the active model snapshot of the job. So I think this check should be altered to only ban upgrading the active snapshot if the job is open.

droberts195 · 2020-11-09T17:36:00Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/JobModelSnapshotUpgrader.java

+            return fieldIndexes;
+        }
+
+        void writeHeader() throws IOException {


I am not surprised you had to write a header. You could probably get away with writing one with just the control field (field name .). But it's not particularly important, so I'm happy to leave what's here.

…apshot-upgrader-api

droberts195

LGTM

I noticed a few more minor things but am happy to merge this once they're resolved.

droberts195 · 2020-11-12T11:26:40Z

...high-level/src/test/java/org/elasticsearch/client/documentation/MlClientDocumentationIT.java

+                UpgradeJobModelSnapshotResponse response = client.machineLearning().upgradeJobSnapshot(request, RequestOptions.DEFAULT);
+                // end::upgrade-job-model-snapshot-execute
+            } catch (ElasticsearchException ex) {
+                // TODO have a true snapshot in the past to upgrade?


In that case I think you should remove the TODO from here and instead have a comment to say that this is just checking syntax because actual upgrade is covered elsewhere.

droberts195 · 2020-11-12T11:33:14Z

...ava/org/elasticsearch/xpack/ml/job/process/autodetect/writer/AutodetectControlMsgWriter.java

@@ -244,4 +244,15 @@ public void writeStartBackgroundPersistMessage() throws IOException {
        fillCommandBuffer();
        lengthEncodedWriter.flush();
    }
+
+    public void writeStartBackgroundPersistMessage(long snapshotTimestamp, String snapshotId, String description) throws IOException {


Please add a Javadoc comment to say whether snapshotTimestamp is in epoch millis or epoch seconds. Also, it might be worth adding Seconds or Millis to the variable name to make ultra clear which it is for future maintainers.

droberts195 · 2020-11-12T11:34:35Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/process/NativeProcess.java

@@ -36,6 +36,15 @@
     */
    void persistState() throws IOException;

+    /**
+     * Ask the process to persist state, even if it is unchanged.
+     * @param snapshotTimestamp The snapshot timestamp


Please add whether this is epoch seconds or epoch millis.

droberts195 · 2020-11-12T11:36:29Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/JobModelSnapshotUpgrader.java

+                                // C++ is expecting the timestamp to be in seconds, not Milliseconds
+                                params.modelSnapshot().getTimestamp().getTime()/1000,


Since Java rarely uses epoch seconds, it's probably better to move the /1000 closer to the point of passing the information to the C++ process, e.g. in AutodetectControlMsgWriter.

droberts195 · 2020-11-12T11:41:09Z

...elasticsearch/xpack/ml/job/process/autodetect/output/JobSnapshotUpgraderResultProcessor.java

+                    bulkResultsPersister.executeRequest();
+                }
+            } catch (Exception e) {
+                LOGGER.warn(new ParameterizedMessage("[{}] Error persisting autodetect results", jobId), e);


Suggested change

LOGGER.warn(new ParameterizedMessage("[{}] Error persisting autodetect results", jobId), e);

LOGGER.warn(new ParameterizedMessage("[{}] Error persisting model snapshot [{}] upgrade results", jobId, snapshotId), e);

droberts195 · 2020-11-12T11:42:10Z

...elasticsearch/xpack/ml/job/process/autodetect/output/JobSnapshotUpgraderResultProcessor.java

+                // that it would have been better to close jobs before shutting down,
+                // but we now fully expect jobs to move between nodes without doing
+                // all their graceful close activities.
+                LOGGER.warn("[{}] some results not processed due to the process being killed", jobId);


Suggested change

LOGGER.warn("[{}] some results not processed due to the process being killed", jobId);

LOGGER.warn("[{}] some model snapshot [{}] upgrade results not processed due to the process being killed", jobId, snapshotId);

droberts195 · 2020-11-12T11:42:38Z

...elasticsearch/xpack/ml/job/process/autodetect/output/JobSnapshotUpgraderResultProcessor.java

+                LOGGER.warn("[{}] some results not processed due to the process being killed", jobId);
+            } else if (process.isProcessAliveAfterWaiting() == false) {
+                // Don't log the stack trace to not shadow the root cause.
+                LOGGER.warn("[{}] some results not processed due to the termination of autodetect", jobId);


Suggested change

LOGGER.warn("[{}] some results not processed due to the termination of autodetect", jobId);

LOGGER.warn("[{}] some model snapshot [{}] upgrade results not processed due to the termination of autodetect", jobId, snapshotId);

droberts195 · 2020-11-12T11:43:08Z

...elasticsearch/xpack/ml/job/process/autodetect/output/JobSnapshotUpgraderResultProcessor.java

+            } else {
+                // We should only get here if the iterator throws in which
+                // case parsing the autodetect output has failed.
+                LOGGER.error(new ParameterizedMessage("[{}] error parsing autodetect output", jobId), e);


Suggested change

LOGGER.error(new ParameterizedMessage("[{}] error parsing autodetect output", jobId), e);

LOGGER.error(new ParameterizedMessage("[{}] error parsing autodetect output during model snapshot [{}] upgrade", jobId, snapshotId), e);

droberts195 · 2020-11-12T11:43:38Z

...elasticsearch/xpack/ml/job/process/autodetect/output/JobSnapshotUpgraderResultProcessor.java

+                    if (isAlive() == false) {
+                        throw e;
+                    }
+                    LOGGER.warn(new ParameterizedMessage("[{}] Error processing autodetect result", jobId), e);


Suggested change

LOGGER.warn(new ParameterizedMessage("[{}] Error processing autodetect result", jobId), e);

LOGGER.warn(new ParameterizedMessage("[{}] Error processing autodetect result during model snapshot [{}] upgrade", jobId, snapshotId), e);

droberts195 · 2020-11-12T11:45:48Z

...elasticsearch/xpack/ml/job/process/autodetect/output/JobSnapshotUpgraderResultProcessor.java

+
+    private void logUnexpectedResult(String resultType) {
+        LOGGER.info("[{}] [{}] unexpected result read [{}]", jobId, snapshotId, resultType);
+    }


Consider adding assert resultType == null or something else that will detect if this happens during our integration tests.

…apshot-upgrader-api

…stic#64665) This new API provides a way for users to upgrade their own anomaly job model snapshots. To upgrade a snapshot the following is done: - Open a native process given the job id and the desired snapshot id - load the snapshot to the process - write the snapshot again from the native task (now updated via the native process) relates elastic#64154

When a persist control message with arguments is received by the anomaly detector it doesn't go through the standard chain of persistence calls, as it unconditionally rewrites the state (even if no data has been seen) and includes only the anomaly detector state rather than the categorizer state too. Because of this the memory usage was not being recalculated prior to persisting the state as would normally happen. This PR rectifies that omission. Fixes one of the problems detailed in elastic/elasticsearch#64665 (review)

#64665) (#65010) * [ML] add new snapshot upgrader API for upgrading older snapshots (#64665) This new API provides a way for users to upgrade their own anomaly job model snapshots. To upgrade a snapshot the following is done: - Open a native process given the job id and the desired snapshot id - load the snapshot to the process - write the snapshot again from the native task (now updated via the native process) relates #64154

When a persist control message with arguments is received by the anomaly detector it doesn't go through the standard chain of persistence calls, as it unconditionally rewrites the state (even if no data has been seen) and includes only the anomaly detector state rather than the categorizer state too. Because of this the memory usage was not being recalculated prior to persisting the state as would normally happen. This PR rectifies that omission. Fixes one of the problems detailed in elastic/elasticsearch#64665 (review)

When a persist control message with arguments is received by the anomaly detector it doesn't go through the standard chain of persistence calls, as it unconditionally rewrites the state (even if no data has been seen) and includes only the anomaly detector state rather than the categorizer state too. Because of this the memory usage was not being recalculated prior to persisting the state as would normally happen. This PR rectifies that omission. Fixes one of the problems detailed in elastic/elasticsearch#64665 (review) Backport of elastic#1585

When a persist control message with arguments is received by the anomaly detector it doesn't go through the standard chain of persistence calls, as it unconditionally rewrites the state (even if no data has been seen) and includes only the anomaly detector state rather than the categorizer state too. Because of this the memory usage was not being recalculated prior to persisting the state as would normally happen. This PR rectifies that omission. Fixes one of the problems detailed in elastic/elasticsearch#64665 (review) Backport of #1585

benwtrent added >enhancement :ml Machine learning v8.0.0 v7.11.0 labels Nov 5, 2020

benwtrent commented Nov 5, 2020

View reviewed changes

droberts195 self-requested a review November 9, 2020 12:00

droberts195 reviewed Nov 9, 2020

View reviewed changes

benwtrent added 5 commits November 9, 2020 14:11

addressing PR comments

e6c6a1f

adding comment about timeouts relation to wait_for_completion

e22a764

Merge remote-tracking branch 'upstream/master' into feature/ml-add-sn…

aec8814

…apshot-upgrader-api

fixing some tests and addressing pr comment

29348b3

fixing style

7ba38b6

droberts195 approved these changes Nov 12, 2020

View reviewed changes

benwtrent added 2 commits November 12, 2020 07:39

Merge remote-tracking branch 'upstream/master' into feature/ml-add-sn…

db82d80

…apshot-upgrader-api

addressing PR comments

51855d6

benwtrent added the backport pending label Nov 12, 2020

benwtrent merged commit 33de89d into elastic:master Nov 12, 2020

benwtrent deleted the feature/ml-add-snapshot-upgrader-api branch November 12, 2020 15:46

benwtrent mentioned this pull request Nov 12, 2020

[7.x] [ML] add new snapshot upgrader API for upgrading older snapshots (#64665) #65010

Merged

lcawl mentioned this pull request Nov 16, 2020

[DOCS] Adds new snapshot upgrade API #65095

Merged

droberts195 mentioned this pull request Nov 17, 2020

[ML] Recalculate memory usage before upgrading model state elastic/ml-cpp#1585

Merged

droberts195 mentioned this pull request Nov 18, 2020

[7.x][ML] Recalculate memory usage before upgrading model state elastic/ml-cpp#1587

Merged

hendrikmuhs mentioned this pull request Nov 23, 2020

[CI] ML job snapshot upgrade IT can leave task behind #65364

Closed

droberts195 removed the backport pending label Dec 17, 2020

stevejgordon mentioned this pull request Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

benwtrent mentioned this pull request Jan 6, 2021

[ML] Add functionality to upgrade ML model state to 7.x format before upgrade to 8.0 #64154

Closed

droberts195 mentioned this pull request Feb 14, 2021

[CI] RankEvalIT failing on ARM workers #68936

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021


		public enum SnapshotUpgradeState implements Writeable {

		READING_NEW_STATE, STOPPED, FAILED;

		public AutodetectProcess createAutodetectProcess(String pipelineId,
		Job job,

		// C++ is expecting the timestamp to be in seconds, not Milliseconds
		params.modelSnapshot().getTimestamp().getTime()/1000,

	LOGGER.warn(new ParameterizedMessage("[{}] Error persisting autodetect results", jobId), e);
	LOGGER.warn(new ParameterizedMessage("[{}] Error persisting model snapshot [{}] upgrade results", jobId, snapshotId), e);

	LOGGER.warn("[{}] some results not processed due to the process being killed", jobId);
	LOGGER.warn("[{}] some model snapshot [{}] upgrade results not processed due to the process being killed", jobId, snapshotId);

	LOGGER.error(new ParameterizedMessage("[{}] error parsing autodetect output", jobId), e);
	LOGGER.error(new ParameterizedMessage("[{}] error parsing autodetect output during model snapshot [{}] upgrade", jobId, snapshotId), e);

[ML] add new snapshot upgrader API for upgrading older snapshots #64665

[ML] add new snapshot upgrader API for upgrading older snapshots #64665

Uh oh!

Conversation

benwtrent commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Nov 5, 2020

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

droberts195 Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Nov 5, 2020

Uh oh!

droberts195 commented Nov 9, 2020

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

droberts195 Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Nov 5, 2020 •

edited

Loading

droberts195 Nov 9, 2020 •

edited

Loading

droberts195 Nov 9, 2020 •

edited

Loading