Definition of how ES|QL cross-cluster searches will honor the skip_unavailable setting

### Description

**UPDATE May 2025**: Change of direction again, now skip_unavailable for ES|QL is going to catch all errors again. 

**UPDATE Jan 2025**: This approach is being changed. For ES|QL, we are moving to limit the scope of the skip_unavailable setting for remote clusters.
Going forward, skip_unavailable will be considered for two scenarios:

1) inability to connect to remote cluster ("unavailable")
2) whether to fail on execution time errors or not (inline with the upcoming allow_partial_search_results work for ES|QL)

-------------

### Overview 
The [`skip_unavailable`](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#skip-unavailable-clusters) remote cluster setting is intended to allow ES admins to specify whether a cross-cluster search should fail or return partial data in the face of a errors on a remote cluster during a cross-cluster search.

For `_search`, if `skip_unavailable` is true, a [cross-cluster search](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#skip-unavailable-clusters):

- Skips the remote cluster if its nodes are unavailable during the search. The response’s _clusters.skipped value contains a count of any skipped clusters and the _clusters.details section of the response will show a skipped status.
- Errors returned by the remote cluster, such as unavailable shards, no matching indices, etc. are not fatal. The search will continue and return results from other clusters.

ESQL cross-cluster searches should also respect this setting, but we need to define exactly how it will work, aligned with the ES|QL philosophy of operation. 

The `skip_unavailable` setting only applies to remote clusters. The local, querying cluster has no such setting and the behavior of errors on the local cluster will act the same as they do currently (modulo [bug fixes](https://github.com/elastic/elasticsearch/issues/114495)).

**Terminology Note:** I use the term "fatal" to describe errors that cause processing to stop and then return a top level error message response and 4xx or 5xx HTTP status. Most errors in ES|QL are currently fatal, by this definition. With the addition of handling of `skip_unavailable=true` clusters, some errors will no longer be fatal - they will be trapped and reported in response metadata and partial data will be returned from other clusters with a "successful" HTTP status of 2xx.

**Note on issue history**: this summary issue is based upon the discussion we had in a [previous issue](https://github.com/elastic/elasticsearch/issues/112886).

### Three separate tickets/PRs

We will implement skip_unavailable handling in ES|QL CCS in three parts:

1. Handling truly "unavailable" clusters - remotes that cannot be contacted via field-caps or enrich policy resolve API during planning.
2. Dealing with errors due to not finding concrete indices specified by the user in the ES|QL `FROM` clause.
3. Handling errors during ESQL query time (after the planning stage). Errors from skip_unavailable=true clusters will be trapped and no longer be fatal.


----

### Ticket 1: Unavailable clusters during planning

As part of ES|QL planning of a cross-cluster search, a field-caps call is done to each cluster and, if an ENRICH command is present, the **enrich policy-resolve API** is called on each remote. If a remote cluster cannot be connected to in these calls, the outcome depends on the `skip_unavailable` setting.

For `skip_unavailable=false` clusters, the error is fatal and the error will immediately be propagated back to the client with a top level error message with a 500 HTTP status response code, as shown in the example below:

<details><summary>(toggle) Response when a skip_unavailable=false cluster could not be connected to</summary>

```
{
 "error": {
 "root_cause": [
 {
 "type": "uncategorized_execution_exception",
 "reason": "Failed execution"
 }
 ],
 "type": "connect_transport_exception",
 "reason": "[][127.0.0.1:9302] connect_exception",
 "caused_by": {
 "type": "uncategorized_execution_exception",
 "reason": "Failed execution",
 "caused_by": {
 "type": "execution_exception",
 "reason": "io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9302",
 "caused_by": {
 "type": "annotated_connect_exception",
 "reason": "Connection refused: /127.0.0.1:9302",
 "caused_by": {
 "type": "connect_exception",
 "reason": "Connection refused"
 }
 }
 }
 }
 },
 "status": 500
}
```

</details>


For `skip_unavailable=true` clusters, the error is not fatal. The error will be trapped, recorded in the [EsqlExecutionInfo](https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlExecutionInfo.java) object for the query, marking the cluster as SKIPPED. If the user requested ccs_metadata to be included, the cluster status and connection failure will be present in the `_clusters/details` section of the response, as shown in the example below:

<details><summary>(toggle) Response when a skip_unavailable=true cluster could not be connected to</summary>

```
 "_clusters": {
 "total": 3,
 "successful": 2,
 "running": 0,
 "skipped": 1,
 "partial": 0,
 "failed": 0,
 "details": {
 "(local)": {
 "status": "successful",
 "indices": "web*",
 "took": 17,
 "_shards": {
 "total": 2,
 "successful": 2,
 "skipped": 0,
 "failed": 0
 }
 },
 "remote2": {
 "status": "successful",
 "indices": "web*",
 "took": 15,
 "_shards": {
 "total": 2,
 "successful": 2,
 "skipped": 0,
 "failed": 0
 }
 },
 "remote1": {
 "status": "skipped",
 "indices": "web*",
 "failures": [
 {
 "shard": -1,
 "index": null,
 "reason": {
 "type": "connect_transport_exception",
 "reason": "Unable to connect to [remote1]"
 }
 }
 ]
 }
 }
```

</details>


If no clusters can be contacted, if they are all marked as `skip_unavailable=true`, no error will be returned. Instead a 200 HTTP status will be returned with no column and no values. If the `include_ccs_metadata: true` setting was included on the query, the errors will listed in the `_clusters` metadata section. (Note: this is also how the _search endpoint works for CCS.)

----

### Ticket 2: Missing concrete indices

The [principle](https://github.com/elastic/elasticsearch/issues/112886#issuecomment-2402532256) that guides this ticket is:

**If a user requests a concrete index that is not found, the query should be failed with a standard exception and HTTP status code, unless the query was done against a remote cluster with the setting `skip_unavailable=true`.**

By "concrete index", I mean a index expression that has no wildcard. Wildcards are treated more leniently. 

To be consistent with existing ES|QL functionality, the HTTP status code for this error will be 400 (unlike _search which uses 404). We will need to have an ESQL-specific `IndexNotFoundException`, since the [one used by _search](https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/index/IndexNotFoundException.java) extends ResourceNotFoundException which returns a 404 status code.

With the exception of some [inconsistencies that need to be fixed](https://github.com/elastic/elasticsearch/issues/112886#issuecomment-2402532256), error handling around concrete (and wildcarded indexes) will be the same as it currently is in ES|QL for the local cluster and for remote clusters marked with `skip_unavailable=false`.

For clusters marked with `skip_unavailable=true`, when a user provides a concrete index that is not found on that cluster it will be a non-fatal error. Three examples below illustrate the proposed behavior.

**Use case 1**: User queries for two concrete indices, one of which is present on the skip_unavailable=true cluster and one of which is not:

```
POST /_query
{
 "query": "FROM logs,remote1:nomatch,remote1:logs|\n STATS count(*) BY foo"
}
```

The query will execute against `logs` on the local cluster and remote1 and record a non-fatal error in the CCS metadata against the missing "nomatch" index.

<details><summary>(toggle) Response for Use Case 1</summary>

```
 "_clusters": {
 "total": 2,
 "successful": 2,
 "running": 0,
 "skipped": 0,
 "partial": 0,
 "failed": 0,
 "details": {
 "(local)": {
 "status": "successful",
 "indices": "logs",
 "took": 17,
 "_shards": {
 "total": 2,
 "successful": 2,
 "skipped": 0,
 "failed": 0
 }
 },
 "remote1": {
 "status": "successful",
 "indices": "nomatch,logs",
 "failures": [
 {
 "shard": -1,
 "index": null,
 "reason": {
 "type": "index_not_found_exception",
 "reason": "[nomatch] index is not found"
 }
 }
 ]
 }
 }
```

</details>


**Use case 2**: User queries for a concrete index which is present on local but not the `skip_unavailable=true` remote:

```
POST /_query
{
 "query": "FROM logs,remote1:nomatch|\n STATS count(*) BY foo"
}
```

The query will execute against `logs` on the local cluster. Since no indexes are matched on the remote1, it is marked as SKIPPED and we record a non-fatal error in the CCS metadata against the missing "nomatch" index.

<details><summary>(toggle) Response for Use Case 2</summary>

```
 "_clusters": {
 "total": 2,
 "successful": 1,
 "running": 0,
 "skipped": 1,
 "partial": 0,
 "failed": 0,
 "details": {
 "(local)": {
 "status": "successful",
 "indices": "logs",
 "took": 17,
 "_shards": {
 "total": 2,
 "successful": 2,
 "skipped": 0,
 "failed": 0
 }
 },
 "remote1": {
 "status": "skipped",
 "indices": "nomatch",
 "failures": [
 {
 "shard": -1,
 "index": null,
 "reason": {
 "type": "index_not_found_exception",
 "reason": "[nomatch] index is not found"
 }
 }
 ]
 }
 }
```

</details>


**Use case 3**: User queries for a indices on two `skip_unavailable=true` remotes where no indices are matched:

```
POST /_query
{
 "query": "FROM remote1:nomatch,remote2:nomatch|\n STATS count(*) BY foo"
}
```

If either `remote1` or `remote2` have setting skip_unavailable=false, an error would be returned. If both are marked with `skip_unavailable=true`, the query will return an HTTP status of 200 with no data and the CCS metadata (if requested) will indicate that both clusters were skipped and include an index_not_found failure message, as shown in the example below.

<details><summary>(toggle) Response for Use Case 3</summary>

```
{
 "took": 10,
 "columns": [],
 "values": [],
 "_clusters": {
 "total": 2,
 "successful": 0,
 "running": 0,
 "skipped": 2,
 "partial": 0,
 "failed": 0,
 "details": {
 "remote2": {
 "status": "skipped",
 "indices": "nomatch",
 "took": 0,
 "failures": [
 {
 "shard": -1,
 "index": null,
 "reason": {
 "type": "index_not_found_exception",
 "reason": "[nomatch] index is not found"
 }
 }
 },
 "remote1": {
 "status": "skipped",
 "indices": "nomatch",
 "took": 0,
 "_shards": {
 "total": 0,
 "failures": [
 {
 "shard": -1,
 "index": null,
 "reason": {
 "type": "index_not_found_exception",
 "reason": "[nomatch] index is not found"
 }
 }
 }
 }
 }
 }
}
```

</details>


----

### Ticket 3: Trapping errors during ES|QL operations (after planning) from `skip_unavailable=true` methods.

To be fully compliant with the `skip_unavailable` model, we will also need to add in error handling during ES|QL processing. If shard errors (or other fatal errors) occur during ES|QL processing on a remote cluster and that cluster is marked as `skip_unavailable=true`, we will trap those errors, avoid returning a 4xx/5xx error to the user and instead mark the cluster as `partial` in the EsqlExecutionInfo, along with failure info, as [we do in _search](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#cross-cluster-search-failures).

Shard/query-time errors from `skip_unavailable=false` clusters will continue to be fatal as they are now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Definition of how ES|QL cross-cluster searches will honor the skip_unavailable setting #114531

Description

Overview

Three separate tickets/PRs

Ticket 1: Unavailable clusters during planning

Ticket 2: Missing concrete indices

Ticket 3: Trapping errors during ES|QL operations (after planning) from `skip_unavailable=true` methods.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Definition of how ES|QL cross-cluster searches will honor the skip_unavailable setting #114531

Description

Description

Overview

Three separate tickets/PRs

Ticket 1: Unavailable clusters during planning

Ticket 2: Missing concrete indices

Ticket 3: Trapping errors during ES|QL operations (after planning) from skip_unavailable=true methods.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Ticket 3: Trapping errors during ES|QL operations (after planning) from `skip_unavailable=true` methods.