Skip to content

Definition of how ES|QL cross-cluster searches will honor the skip_unavailable setting #114531

Open

Description

Description

Overview

The skip_unavailable remote cluster setting is intended to allow ES admins to specify whether a cross-cluster search should fail or return partial data in the face of a errors on a remote cluster during a cross-cluster search.

For _search, if skip_unavailable is true, a cross-cluster search:

  • Skips the remote cluster if its nodes are unavailable during the search. The response’s _clusters.skipped value contains a count of any skipped clusters and the _clusters.details section of the response will show a skipped status.
  • Errors returned by the remote cluster, such as unavailable shards, no matching indices, etc. are not fatal. The search will continue and return results from other clusters.

ESQL cross-cluster searches should also respect this setting, but we need to define exactly how it will work, aligned with the ES|QL philosophy of operation.

The skip_unavailable setting only applies to remote clusters. The local, querying cluster has no such setting and the behavior of errors on the local cluster will act the same as they do currently (modulo bug fixes).

Terminology Note: I use the term "fatal" to describe errors that cause processing to stop and then return a top level error message response and 4xx or 5xx HTTP status. Most errors in ES|QL are currently fatal, by this definition. With the addition of handling of skip_unavailable=true clusters, some errors will no longer be fatal - they will be trapped and reported in response metadata and partial data will be returned from other clusters with a "successful" HTTP status of 2xx.

Note on issue history: this summary issue is based upon the discussion we had in a previous issue.

Three separate tickets/PRs

We will implement skip_unavailable handling in ES|QL CCS in three parts:

  1. Handling truly "unavailable" clusters - remotes that cannot be contacted via field-caps or enrich policy resolve API during planning.
  2. Dealing with errors due to not finding concrete indices specified by the user in the ES|QL FROM clause.
  3. Handling errors during ESQL query time (after the planning stage). Errors from skip_unavailable=true clusters will be trapped and no longer be fatal.

Ticket 1: Unavailable clusters during planning

As part of ES|QL planning of a cross-cluster search, a field-caps call is done to each cluster and, if an ENRICH command is present, the enrich policy-resolve API is called on each remote. If a remote cluster cannot be connected to in these calls, the outcome depends on the skip_unavailable setting.

For skip_unavailable=false clusters, the error is fatal and the error will immediately be propagated back to the client with a top level error message with a 500 HTTP status response code, as shown in the example below:

(toggle) Response when a skip_unavailable=false cluster could not be connected to
{
  "error": {
    "root_cause": [
      {
        "type": "uncategorized_execution_exception",
        "reason": "Failed execution"
      }
    ],
    "type": "connect_transport_exception",
    "reason": "[][127.0.0.1:9302] connect_exception",
    "caused_by": {
      "type": "uncategorized_execution_exception",
      "reason": "Failed execution",
      "caused_by": {
        "type": "execution_exception",
        "reason": "io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9302",
        "caused_by": {
          "type": "annotated_connect_exception",
          "reason": "Connection refused: /127.0.0.1:9302",
          "caused_by": {
            "type": "connect_exception",
            "reason": "Connection refused"
          }
        }
      }
    }
  },
  "status": 500
}

For skip_unavailable=true clusters, the error is not fatal. The error will be trapped, recorded in the EsqlExecutionInfo object for the query, marking the cluster as SKIPPED. If the user requested ccs_metadata to be included, the cluster status and connection failure will be present in the _clusters/details section of the response, as shown in the example below:

(toggle) Response when a skip_unavailable=true cluster could not be connected to
  "_clusters": {
    "total": 3,
    "successful": 2,
    "running": 0,
    "skipped": 1,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "web*",
        "took": 17,
        "_shards": {
          "total": 2,
          "successful": 2,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "successful",
        "indices": "web*",
        "took": 15,
        "_shards": {
          "total": 2,
          "successful": 2,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "skipped",
        "indices": "web*",
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "connect_transport_exception",
              "reason": "Unable to connect to [remote1]"
            }
          }
        ]
      }
    }

If no clusters can be contacted, if they are all marked as skip_unavailable=true, no error will be returned. Instead a 200 HTTP status will be returned with no column and no values. If the include_ccs_metadata: true setting was included on the query, the errors will listed in the _clusters metadata section. (Note: this is also how the _search endpoint works for CCS.)


Ticket 2: Missing concrete indices

The principle that guides this ticket is:

If a user requests a concrete index that is not found, the query should be failed with a standard exception and HTTP status code, unless the query was done against a remote cluster with the setting skip_unavailable=true.

By "concrete index", I mean a index expression that has no wildcard. Wildcards are treated more leniently.

To be consistent with existing ES|QL functionality, the HTTP status code for this error will be 400 (unlike _search which uses 404). We will need to have an ESQL-specific IndexNotFoundException, since the one used by _search extends ResourceNotFoundException which returns a 404 status code.

With the exception of some inconsistencies that need to be fixed, error handling around concrete (and wildcarded indexes) will be the same as it currently is in ES|QL for the local cluster and for remote clusters marked with skip_unavailable=false.

For clusters marked with skip_unavailable=true, when a user provides a concrete index that is not found on that cluster it will be a non-fatal error. Three examples below illustrate the proposed behavior.

Use case 1: User queries for two concrete indices, one of which is present on the skip_unavailable=true cluster and one of which is not:

POST /_query
{
  "query": "FROM logs,remote1:nomatch,remote1:logs|\n STATS count(*) BY foo"
}

The query will execute against logs on the local cluster and remote1 and record a non-fatal error in the CCS metadata against the missing "nomatch" index.

(toggle) Response for Use Case 1
  "_clusters": {
    "total": 2,
    "successful": 2,
    "running": 0,
    "skipped": 0,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "logs",
        "took": 17,
        "_shards": {
          "total": 2,
          "successful": 2,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "successful",
        "indices": "nomatch,logs",
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "index_not_found_exception",
              "reason": "[nomatch] index is not found"
            }
          }
        ]
      }
    }

Use case 2: User queries for a concrete index which is present on local but not the skip_unavailable=true remote:

POST /_query
{
  "query": "FROM logs,remote1:nomatch|\n STATS count(*) BY foo"
}

The query will execute against logs on the local cluster. Since no indexes are matched on the remote1, it is marked as SKIPPED and we record a non-fatal error in the CCS metadata against the missing "nomatch" index.

(toggle) Response for Use Case 2
  "_clusters": {
    "total": 2,
    "successful": 1,
    "running": 0,
    "skipped": 1,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "logs",
        "took": 17,
        "_shards": {
          "total": 2,
          "successful": 2,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "skipped",
        "indices": "nomatch",
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "index_not_found_exception",
              "reason": "[nomatch] index is not found"
            }
          }
        ]
      }
    }

Use case 3: User queries for a indices on two skip_unavailable=true remotes where no indices are matched:

POST /_query
{
  "query": "FROM remote1:nomatch,remote2:nomatch|\n STATS count(*) BY foo"
}

If either remote1 or remote2 have setting skip_unavailable=false, an error would be returned. If both are marked with skip_unavailable=true, the query will return an HTTP status of 200 with no data and the CCS metadata (if requested) will indicate that both clusters were skipped and include an index_not_found failure message, as shown in the example below.

(toggle) Response for Use Case 3
{
  "took": 10,
  "columns": [],
  "values": [],
  "_clusters": {
    "total": 2,
    "successful": 0,
    "running": 0,
    "skipped": 2,
    "partial": 0,
    "failed": 0,
    "details": {
      "remote2": {
        "status": "skipped",
        "indices": "nomatch",
        "took": 0,
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "index_not_found_exception",
              "reason": "[nomatch] index is not found"
            }
          }
      },
      "remote1": {
        "status": "skipped",
        "indices": "nomatch",
        "took": 0,
        "_shards": {
          "total": 0,
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "index_not_found_exception",
              "reason": "[nomatch] index is not found"
            }
          }
        }
      }
    }
  }
}

Ticket 3: Trapping errors during ES|QL operations (after planning) from skip_unavailable=true methods.

To be fully compliant with the skip_unavailable model, we will also need to add in error handling during ES|QL processing. If shard errors (or other fatal errors) occur during ES|QL processing on a remote cluster and that cluster is marked as skip_unavailable=true, we will trap those errors, avoid returning a 4xx/5xx error to the user and instead mark the cluster as partial in the EsqlExecutionInfo, along with failure info, as we do in _search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions