Description
Anomaly detector jobs and datafeed configurations are currently stored in the cluster state, the initial design decision was based on the need to distribute work across the cluster. After persistent tasks were invented -and with a little hindsight- it is not necessary to store config in the cluster state if the persistent task parameters contain enough information to open a job and start a datafeed.
Placing config in the cluster state has caused a number of issues:
- Jobs cannot be easily transferred to another cluster
- User defined searches in the datafeed may use deprecated features removed in the next version. After upgrading this can fail reading the cluster state as parsing the removed feature throws an exception
- If you restore a snapshotted cluster state you restore all jobs that existed at the time the snapshot was taken
- Jobs do not recover if the cluster state is lost [ML] Jobs do not recover if the cluster state is deleted #30088
- Adding new types of job or changing the validation is difficult or impossible in a backwards compatible manner [ML] Validate existing cluster state differently to newly submitted configs #30084 [ML] Opening a job with duplicate detectors fail #30070
Proposal
Starting in 6.last new job and datafeed configurations will be stored in a new internal index .ml-config
. Jobs created prior to 6.last will be migrated to index documents and removed from the cluster state with the goal of removing all ml data from the cluster state. 7.x will retain the ability to read jobs from the cluster state to support full cluster upgrades and restoring a snapshot containing a global cluster state. The preferred solution is to automatically migrate extant cluster state jobs on allocation. In the 8 series all the code to handle ml config in cluster state will be dropped.
Work Plan
Work will be done on the feature branch feature/feature-jindex-6x
and feature/feature-jindex-master
with regular merges. The first stage is the changes required to create and run a job with its configuration stored in an index. Once that is stable and passing the testing gate the migration of existing jobs on upgrade will be tackled.
Phase 1: Run Jobs with their configuration defined in an index
- Add the
.ml-config
index template with job and datafeed mappings [ML] Job and datafeed mappings with index template #32719 - Add class to handle read/write/update/deletes on config documents [ML] Datafeed config CRUD operations #32854 [ML] Job config document CRUD operations #32738
- Wildcard expansion for getting jobs and datafeed e.g.
_all
,foo-*
. This replaces the MlMetaData Group or Job Lookup functionality [ML] Datafeed config CRUD operations #32854 [ML] Job config document CRUD operations #32738 - Change JobManager to operate on migrated configs [ML] Change JobManager to work with Job config in index #33064
- Change DatafeedManager to operate on migrated configs [ML] Change Datafeed actions to read config from the config index #33273
- The OpenJobPersistentTasksExecutor validate and selectLeastLoadedMlNode methods require information that can no longer be read from the cluster state [ML] Investigate alternative methods for sharing job memory usage information #34084
- DatafeedNodeSelector requires information that can no longer be read from the cluster state [ML] Job in index: Datafeed node selector #34218
- Close job and TransportFinalizeJobExecutionAction must be changed so it doesn't update the cluster state [ML] Close job defined in index #34217 [ML] Adjust finalize job action to work with documents #34226
- AbstractExpiredJobDataRemover must read migrated job configurations [ML] Job in Index: Convert job data remover to work with index configs #34532
- Delete job and implement the equivalent of markJobAsDeleting [ML] Delete job document #34595
- Get job and datafeed stats [ML] Job in index: Get datafeed and job stats from index #34645
- All the endpoints [ML] Job in Index: Stop and preview datafeed #34605 [ML] Job in index: delete filter action #34642 [ML] Job in Index: Convert get calendar events to index docs #34710
- All the tests (excluding rolling upgrade) [ML] Job in Index: Enable integ tests #34851
- Replace Version.CURRENT with the release version [ML] JIndex: Replace Version.CURRENT in streaming functions #36118
Phase 1a: 6.6 & 6.7 Jobs can be defined in the clusterstate or an index document
- All reads of config must check both index and clusterstate [ML] Job In Index: Enable GET APIS in mixed state #35344 [ML] Job in index: Restore ability to update cluster state jobs #35539 [ML] Job in index: Enable delete actions for clusterstate config #35590 [ML] Job in index: Enable get and update actions for clusterstate jobs #35598
- Upgrade tests [ML] Jindex: Rolling upgrade tests #35700 [ML] Full cluster restart tests for migration #36593
Phase 2: Migrate Job and Datafeed Configuration
- Where config exists in both index and clusterstate prefer
indexclusterstate [ML] Jindex: Prefer index config documents to cluster state config #35940 [ML] Prefer cluster state config to index documents #36014 - Create a migration class for jobs and datafeeds [ML] Job In Index: Migrate config from the clusterstate #35834
- Automatically migrate closed jobs and datafeeds [ML] Job In Index: Migrate config from the clusterstate #35834 [ML] JIndex: Prevent updates to migrating configs and upgrade tests #36425 [ML] JIindex: Limit the size of bulk migrations #36481
- Migrate open jobs and datafeeds once they become unallocated [ML] Migrate unallocated jobs and datafeeds #37430 [ML] Migrate unallocated jobs and datafeeds #37536
Issues
- Alternative method to share job memory usage. Blocker [ML] Investigate alternative methods for sharing job memory usage information #34084 [ML] Reimplement established model memory #35263
- Parsing deprecated and removed features in Datafeed Config Blocker [ML] Address parsing deprecated and removed features in Datafeed Config #34858
- Backup MlMetadata before starting migration [ML] Snapshot MlMetadata before migration #36422
- Repeatedly migrate config in batches [FEATURE][ML] Split in batches and migrate all jobs and datafeeds #36716
- Create .
ml-config
index in case autocreate is disabled [ML] Create the ml-config index #36608 - Transient setting override [FEATURE][ML] Add cluster setting to enable/disable config migration #36700
- Check .ml-config is yellow [ML] Create the ml-config index #36608
- CI failure MlDistributedFailureIT.testLoseDedicatedMasterNode [CI][ML] MlDistributedFailureIT.testLoseDedicatedMasterNode randomly fails on feature-jindex-master branch #36760
- CI failure RestoreModelSnapshotIT [CI] [ML] RestoreModelSnapshotIT failures #36849
- CI failure MlMigrationFullClusterRestartIT [CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks #36816
- CI failure 60_ml_config_migration [CI] Failures on 6.x for 60_ml_config_migration/Test old cluster jobs and datafeeds and delete them #36810
- CI failure model mem limit [CI][ML] Rolling upgrade failure in '30_ml_jobs_crud/Test model memory limit is updated' #36961
- NPE in MachineLearningFeatureSet.addJobsUsage NullPointerException in MachineLearningFeatureSet$Retriever.addJobsUsage ml-cpp#351
- UnusedStateRemover will remove all state for jobs not in cluster state Blocker [ML] UnusedStateRemover will remove all state for jobs not in cluster state #37109
- MachineLearningLicensingTests [CI] IndicesQueryCache.close assertion failure in internal cluster tests #37117
Nice to haves
- TransportJobTaskAction no longer throws missing job [ML] Job in Index Feature: TransportJobTaskAction no longer detects unknown jobs #34747
- Get Jobs response is limited by search size [ML] Handle large numbers of jobs in the GET jobs response #34864 [ML] Create the ml-config index #36608
- Consider changing where config is validated [ML] Consider validating jobs outside of the builder #34899
- Use custom all field for configs [ML] Add custom all field to config index #36445
- Don't write empty lists of jobs and datafeeds in MlMetadata.toXContent [ML] Hide empty config lists in MlMetadata post migration #36421
- Index aliases for .ml-config
- OpenJobParams (p task params) contain the full job when only certain fields are required
- Ensure ordering of updates to the autodetect process