ES|QL cross-cluster searches honor the skip_unavailable cluster setting

**UPDATE**: This issue had a lot of discussion to work through the approach. A [new issue](https://github.com/elastic/elasticsearch/issues/114531) has been created that summarizes the proposed approaches to handling `skip_unavailable` in ES|QL.

### Description

### Overview 
The [`skip_unavailable`](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#skip-unavailable-clusters) remote cluster setting is intended to allow ES admins to specify whether a cross-cluster search should fail or return partial data in the face of a errors on a remote cluster during a cross-cluster search.

For `_search`, if `skip_unavailable` is true, a [cross-cluster search](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#skip-unavailable-clusters):

- Skips the remote cluster if its nodes are unavailable during the search. The response’s _clusters.skipped value contains a count of any skipped clusters and the _clusters.details section of the response will show a skipped status.
- Errors returned by the remote cluster, such as unavailable shards, no matching indices, etc. are not fatal. The search will continue and return results from other clusters.

ESQL cross-cluster searches should also respect this setting, but we need to define exactly how it should work.

### Proposed Implementation in ES|QL. Phase 1: field-caps and enrich policy-resolve APIs

To start, support for `skip_unavailable` should be implemented in both the field-caps and enrich policy-resolve APIs, which occur as part of the "pre-analysis" phase of ES|QL processing. 

When a remote cluster cannot be connected to during the field-caps or enrich policy-resolve steps:
- if `skip_unavailable=true` (the default setting) for the remote cluster, the cluster will be marked as SKIPPED in the [EsqlExcecutionInfo](https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlExecutionInfo.java) metadata object for that search and reported as skipped in the `_clusters/details` metadata section of the ES|QL response and a failure reason will be provided (see examples section below).
- if `skip_unavailable=false`, then a 500 HTTP status code is returned along with a single top level error, as `_search` does. 


If the index expression provided does not match any indices on a cluster, **how should we handle that**? I propose that we follow the pattern in `_search`:
- if `skip_unavailable=true` (the default setting) for the remote cluster, the cluster will be marked as SKIPPED along with a "index_not_found" failure message
- if `skip_unavailable=false` for the remote cluster, the cluster will be marked as SKIPPED along with a "index_not_found" failure message, ONLY IF the index expression was specified with a wildcard (lenient handling) - see example below
- if `skip_unavailable=false` for the remote cluster, and a concrete index was specified by the client (no wildcards), the error is fatal and HTTP 500 status will be returned with "index_not_found" failure message (see example below)

An additional consideration is how to treat the local cluster. It does not have an explicit skip_unavailable setting. Since it is the coordinating cluster, it will never be unavailable, but we need to decide how to handle the case when the index expression provided matches no indices - is it a fatal error or should we just mark the local cluster as skipped?

I propose that we treat the local cluster like skip_unavailable=true for this case, for three reasons:

1. users cannot change the skip_unavailable setting for the local cluster
2. skip_unavailable=true is the default for remote clusters, so it should be the default for local also
3. this behavior is consistent with how ES|QL currently behaves. (Right now, as long as one cluster has matching indices, the search will proceed and return data from the clusters with matching indices.)

### Reference

I have documented how _search and ES|QL currently behaves with respect to indices not matching here: https://gist.github.com/quux00/a1256fd43947421e3a6993f982d065e8



### Examples of proposed ES|QL response outputs

In these examples:

`remote1` has `skip_unavailable=true`
`remote2` has `skip_unavailable=false`


<details><summary>(toggle) Fatal error (404) when index not found and wildcard NOT used in remote2</summary>

```
POST /_query/async
{
  "query": "FROM *,remote1:x,remote2:x|\n STATS count(*) by authors.last_name | LIMIT 4"
}

// response
{
  "error": {
    "root_cause": [
      {
        "type": "index_not_found_exception",
        "reason": "no such index [x] and No matching index for [x] was found on [remote2] cluster (which has skip_unavailable=false)",
        "index_uuid": "_na_",
        "index": "x"
      }
    ],
    "type": "index_not_found_exception",
    "reason": "no such index [x] and No matching index for [x] was found on [remote2] cluster (which has skip_unavailable=false)",
    "index_uuid": "_na_",
    "index": "x"
  },
  "status": 404
}
```

</details>


<details><summary>(toggle) Skipped clusters when index not found and wildcard used in remote2</summary>

```
POST /_query/async?drop_null_columns
{
  "query": "FROM *,remote1:x,remote2:x*|\n STATS count(*) by authors.last_name | LIMIT 4"
}

// response
  "_clusters": {
    "total": 3,
    "successful": 1,
    "running": 0,
    "skipped": 2,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "*",
        "took": 50,
        "_shards": {
          "total": 21,
          "successful": 21,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "skipped",
        "indices": "x*",
        "took": 0,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        },
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "index_not_found_exception",
              "reason": "no such index [x*]",
              "index_uuid": "_na_",
              "index": "x*"
            }
          }
        ]
      },
      "remote1": {
        "status": "skipped",
        "indices": "x",
        "took": 0,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        },
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "index_not_found_exception",
              "reason": "no such index [x]",
              "index_uuid": "_na_",
              "index": "x"
            }
          }
        ]
      }
    }
  }
```

</details>

<details><summary>(toggle) Skipped cluster remote1 when not available</summary>

```
  "_clusters": {
    "total": 3,
    "successful": 2,
    "running": 0,
    "skipped": 1,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "*",
        "took": 181,
        "_shards": {
          "total": 21,
          "successful": 21,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "successful",
        "indices": "*",
        "took": 180,
        "_shards": {
          "total": 12,
          "successful": 12,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "skipped",
        "indices": "*",
        "failures": [
          {
            "shard": -1,
            "index": null,
            "reason": {
              "type": "connect_transport_exception",
              "reason": "Unable to connect to [remote1]"
            }
          }
        ]
      }
    }
  }
```

</details>

<details><summary>(toggle) Fatal error since remote2 not available</summary>

```
{
  "error": {
    "root_cause": [
      {
        "type": "uncategorized_execution_exception",
        "reason": "Failed execution"
      }
    ],
    "type": "connect_transport_exception",
    "reason": "[][127.0.0.1:9302] connect_exception",
    "caused_by": {
      "type": "uncategorized_execution_exception",
      "reason": "Failed execution",
      "caused_by": {
        "type": "execution_exception",
        "reason": "io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9302",
        "caused_by": {
          "type": "annotated_connect_exception",
          "reason": "Connection refused: /127.0.0.1:9302",
          "caused_by": {
            "type": "connect_exception",
            "reason": "Connection refused"
          }
        }
      }
    }
  },
  "status": 500
}
```

</details>


<br/>

### Proposed Implementation in ES|QL. Phase 2: trapping errors during ES|QL operations (after planning)

To be fully compliant with the `skip_unavailable` model, we will also need to add in error handling during ES|QL processing. If shard  errors (or other fatal errors) occur during ES|QL processing on a remote cluster and that cluster is marked as skip_unavailable=true, we will need to trap those errors, avoid returning a 4xx/5xx error to the user and instead mark the cluster either as `skipped` or `partial` (depending on whether we can use the partial data that came back) in the EsqlExecutionInfo, along with failure info, as [we do in _search](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#cross-cluster-search-failures).

Since ES|QL currently treats failures during ES|QL processing as fatal, I do not know how hard adding this feature will be. I would like feedback from the ES|QL team on how feasible this is and how it could be done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ES|QL cross-cluster searches honor the skip_unavailable cluster setting #112886

Description

Overview

Proposed Implementation in ES|QL. Phase 1: field-caps and enrich policy-resolve APIs

Reference

Examples of proposed ES|QL response outputs

Proposed Implementation in ES|QL. Phase 2: trapping errors during ES|QL operations (after planning)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ES|QL cross-cluster searches honor the skip_unavailable cluster setting #112886

Description

Description

Overview

Proposed Implementation in ES|QL. Phase 1: field-caps and enrich policy-resolve APIs

Reference

Examples of proposed ES|QL response outputs

Proposed Implementation in ES|QL. Phase 2: trapping errors during ES|QL operations (after planning)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions