Skip to content

[ML] Improve logging for reindex with semantic_text fields #134219

@prwhelan

Description

@prwhelan

Use Case: Users can experience issues with reindexing data from an original index to a new index with embedding and semantic_text fields. The reindexing process starts but no documents appear in the destination index.

Users wanting to reindex (index size ~5k documents), mappings are different, in particular the source index has embedding field (it's ESS so we can see it):

        "embedding_field": {
          "type": "semantic_text",
          "inference_id": ".elser-2-elasticsearch"
        }
POST _reindex?wait_for_completion=false
{
"source": { "index": "my-source-index"},
"dest": { "index": "my-dest-index"}
}

reindex task continues without any issues and also without any progress: GET _tasks/<task id> will show the task.

ES logs will show a WARN:

[<time>][WARN ][org.elasticsearch.xpack.ml.inference.adaptiveallocations.AdaptiveAllocationsScalerService] [<instance>] adaptive allocations scaler: scaling [.elser-2-elasticsearch] to [4] allocations failed.
org.elasticsearch.ElasticsearchStatusException: Could not update deployment because there are not enough resources to provide all requested allocations
	at org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.increaseNumberOfAllocations(TrainedModelAssignmentClusterService.java:994) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.lambda$updateAssignment$18(TrainedModelAssignmentClusterService.java:956) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:956) ~[elasticsearch-8.17.3.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1575) ~[?:?]

ML node size 4GB RAM:

GET _ml/trained_models/_stats
{
  "count": 3,
  "trained_model_stats": [
    {
      "model_id": ".elser_model_2",
      "model_size_stats": {
        "model_size_bytes": 438123914,
        "required_native_memory_bytes": 2101346304
      },
      "pipeline_count": 0,
      "inference_stats": {
        "failure_count": 0,
        "inference_count": 0,
        "cache_miss_count": 0,
        "missing_all_fields_count": 0,
        "timestamp": 1756456905350
      },
      "deployment_stats": {
        "deployment_id": ".elser-2-elasticsearch",
        "model_id": ".elser_model_2",
        "threads_per_allocation": 1,
        "number_of_allocations": 0,
        "adaptive_allocations": {
          "enabled": true,
          "min_number_of_allocations": 0,
          "max_number_of_allocations": 32
        },
        "queue_capacity": 10000,
        "state": "started",
        "allocation_status": {
          "allocation_count": 0,
          "target_allocation_count": 0,
          "state": "fully_allocated"
        },
        "cache_size": "417.8mb",
        "priority": "normal",
        "start_time": 1751635728892,
        "peak_throughput_per_minute": 0,
        "nodes": []
      }
    },
    {
      "model_id": ".elser_model_2_linux-x86_64",
      "model_size_stats": {
        "model_size_bytes": 274756282,
        "required_native_memory_bytes": 2101346304
      },
      "pipeline_count": 1,
      "ingest": {
        "total": {
          "count": 0,
          "time_in_millis": 0,
          "current": 0,
          "failed": 0
        },
        "pipelines": {
          ".kibana-observability-ai-assistant-kb-ingest-pipeline": {
            "count": 0,
            "time_in_millis": 0,
            "current": 0,
            "failed": 0,
            "ingested_as_first_pipeline_in_bytes": 0,
            "produced_as_first_pipeline_in_bytes": 0,
            "processors": [
              {
                "inference": {
                  "type": "inference",
                  "stats": {
                    "count": 0,
                    "time_in_millis": 0,
                    "current": 0,
                    "failed": 0
                  }
                }
              }
            ]
          }
        }
      },
      "inference_stats": {
        "failure_count": 0,
        "inference_count": 0,
        "cache_miss_count": 0,
        "missing_all_fields_count": 0,
        "timestamp": 1756456905350
      },
      "deployment_stats": {
        "deployment_id": "my-elser-endpoint",
        "model_id": ".elser_model_2_linux-x86_64",
        "threads_per_allocation": 1,
        "number_of_allocations": 1,
        "queue_capacity": 10000,
        "state": "started",
        "allocation_status": {
          "allocation_count": 1,
          "target_allocation_count": 1,
          "state": "fully_allocated"
        },
        "cache_size": "262mb",
        "priority": "normal",
        "start_time": 1750851473245,
        "peak_throughput_per_minute": 0,
        "nodes": [
        ]
      }
    },
    {
      "model_id": "lang_ident_model_1",
      "model_size_stats": {
        "model_size_bytes": 1053992,
        "required_native_memory_bytes": 0
      },
      "pipeline_count": 0
    }
  ]
}

So .elser-2-elasticsearch is not allocated . It is not obvious that ML node autoscaling must be enabled to scale up to handle the reindex.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :mlMachine learning>bugTeam:MLMeta label for the ML team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions