Skip to content

ES|QL cross-cluster searches honor the skip_unavailable cluster setting #112886

Description

UPDATE: This issue had a lot of discussion to work through the approach. A new issue has been created that summarizes the proposed approaches to handling skip_unavailable in ES|QL.

Description

Overview

The skip_unavailable remote cluster setting is intended to allow ES admins to specify whether a cross-cluster search should fail or return partial data in the face of a errors on a remote cluster during a cross-cluster search.

For _search, if skip_unavailable is true, a cross-cluster search:

  • Skips the remote cluster if its nodes are unavailable during the search. The response’s _clusters.skipped value contains a count of any skipped clusters and the _clusters.details section of the response will show a skipped status.
  • Errors returned by the remote cluster, such as unavailable shards, no matching indices, etc. are not fatal. The search will continue and return results from other clusters.

ESQL cross-cluster searches should also respect this setting, but we need to define exactly how it should work.

Proposed Implementation in ES|QL. Phase 1: field-caps and enrich policy-resolve APIs

To start, support for skip_unavailable should be implemented in both the field-caps and enrich policy-resolve APIs, which occur as part of the "pre-analysis" phase of ES|QL processing.

When a remote cluster cannot be connected to during the field-caps or enrich policy-resolve steps:

  • if skip_unavailable=true (the default setting) for the remote cluster, the cluster will be marked as SKIPPED in the EsqlExcecutionInfo metadata object for that search and reported as skipped in the _clusters/details metadata section of the ES|QL response and a failure reason will be provided (see examples section below).
  • if skip_unavailable=false, then a 500 HTTP status code is returned along with a single top level error, as _search does.

If the index expression provided does not match any indices on a cluster, how should we handle that? I propose that we follow the pattern in _search:

  • if skip_unavailable=true (the default setting) for the remote cluster, the cluster will be marked as SKIPPED along with a "index_not_found" failure message
  • if skip_unavailable=false for the remote cluster, the cluster will be marked as SKIPPED along with a "index_not_found" failure message, ONLY IF the index expression was specified with a wildcard (lenient handling) - see example below
  • if skip_unavailable=false for the remote cluster, and a concrete index was specified by the client (no wildcards), the error is fatal and HTTP 500 status will be returned with "index_not_found" failure message (see example below)

An additional consideration is how to treat the local cluster. It does not have an explicit skip_unavailable setting. Since it is the coordinating cluster, it will never be unavailable, but we need to decide how to handle the case when the index expression provided matches no indices - is it a fatal error or should we just mark the local cluster as skipped?

I propose that we treat the local cluster like skip_unavailable=true for this case, for three reasons:

  1. users cannot change the skip_unavailable setting for the local cluster
  2. skip_unavailable=true is the default for remote clusters, so it should be the default for local also
  3. this behavior is consistent with how ES|QL currently behaves. (Right now, as long as one cluster has matching indices, the search will proceed and return data from the clusters with matching indices.)

Reference

I have documented how _search and ES|QL currently behaves with respect to indices not matching here: https://gist.github.com/quux00/a1256fd43947421e3a6993f982d065e8

Examples of proposed ES|QL response outputs

In these examples:

remote1 has skip_unavailable=true
remote2 has skip_unavailable=false

(toggle) Fatal error (404) when index not found and wildcard NOT used in remote2
POST /_query/async
{
  "query": "FROM *,remote1:x,remote2:x|\n STATS count(*) by authors.last_name | LIMIT 4"
}

// response
{
  "error": {
    "root_cause": [
      {
        "type": "index_not_found_exception",
        "reason": "no such index [x] and No matching index for [x] was found on [remote2] cluster (which has skip_unavailable=false)",
        "index_uuid": "_na_",
        "index": "x"
      }
    ],
    "type": "index_not_found_exception",
    "reason": "no such index [x] and No matching index for [x] was found on [remote2] cluster (which has skip_unavailable=false)",
    "index_uuid": "_na_",
    "index": "x"
  },
  "status": 404
}
(toggle) Skipped clusters when index not found and wildcard used in remote2
POST /_query/async?drop_null_columns
{
  "query": "FROM *,remote1:x,remote2:x*|\n STATS count(*) by authors.last_name | LIMIT 4"
}

// response
  "_clusters": {
    "total": 3,
    "successful": 1,
    "running": 0,
    "skipped": 2,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "*",
        "took": 50,
        "_shards": {
          "total": 21,
          "successful": 21,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "skipped",
        "indices": "x*",
        "took": 0,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        },
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "index_not_found_exception",
              "reason": "no such index [x*]",
              "index_uuid": "_na_",
              "index": "x*"
            }
          }
        ]
      },
      "remote1": {
        "status": "skipped",
        "indices": "x",
        "took": 0,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        },
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "index_not_found_exception",
              "reason": "no such index [x]",
              "index_uuid": "_na_",
              "index": "x"
            }
          }
        ]
      }
    }
  }
(toggle) Skipped cluster remote1 when not available
  "_clusters": {
    "total": 3,
    "successful": 2,
    "running": 0,
    "skipped": 1,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "*",
        "took": 181,
        "_shards": {
          "total": 21,
          "successful": 21,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "successful",
        "indices": "*",
        "took": 180,
        "_shards": {
          "total": 12,
          "successful": 12,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "skipped",
        "indices": "*",
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "connect_transport_exception",
              "reason": "Unable to connect to [remote1]"
            }
          }
        ]
      }
    }
  }
(toggle) Fatal error since remote2 not available
{
  "error": {
    "root_cause": [
      {
        "type": "uncategorized_execution_exception",
        "reason": "Failed execution"
      }
    ],
    "type": "connect_transport_exception",
    "reason": "[][127.0.0.1:9302] connect_exception",
    "caused_by": {
      "type": "uncategorized_execution_exception",
      "reason": "Failed execution",
      "caused_by": {
        "type": "execution_exception",
        "reason": "io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9302",
        "caused_by": {
          "type": "annotated_connect_exception",
          "reason": "Connection refused: /127.0.0.1:9302",
          "caused_by": {
            "type": "connect_exception",
            "reason": "Connection refused"
          }
        }
      }
    }
  },
  "status": 500
}

Proposed Implementation in ES|QL. Phase 2: trapping errors during ES|QL operations (after planning)

To be fully compliant with the skip_unavailable model, we will also need to add in error handling during ES|QL processing. If shard errors (or other fatal errors) occur during ES|QL processing on a remote cluster and that cluster is marked as skip_unavailable=true, we will need to trap those errors, avoid returning a 4xx/5xx error to the user and instead mark the cluster either as skipped or partial (depending on whether we can use the partial data that came back) in the EsqlExecutionInfo, along with failure info, as we do in _search.

Since ES|QL currently treats failures during ES|QL processing as fatal, I do not know how hard adding this feature will be. I would like feedback from the ES|QL team on how feasible this is and how it could be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions