Optimize mappings/storage/query of datasets

Elastic integration packages load assets like mappings to ensure the data is indexed in the best way and dashboards / visualisations and other assets to consume the data. Currently there is no testing in place that validates if all the fields indexed are needed and are indexed in the best way. The following is a potential proposal on how some additional validation can be added.

Side note: More details around the topic can be found in the presentation from @3kt ([Slides](https://drive.google.com/file/d/1g7Ty_HmpR2R3uRAaJKEUsv-g5FokEzVH/view), [Recording](https://youtu.be/gPa79kB2AfI)).


## Compare what is mapped and ingested vs used

Each dataset maps the field that are used in the fields.yml files. With the introduction of the dynamic `ecs@mappings` template in Elasticsearch, now only the fields which are not ECS have to be mapped. Lets quickly introduce the relevant APIs based on the `metrics-system.cpu-default` data stream.

* Mappings: To find out, what fields are mapped in an index, `GET metrics-system.cpu-default/_mapping` [API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-mapping.html) can be used.
* Disk usage per field: After ingestion, to find out which field uses how much storage, used the [analyze index disk usage API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-disk-usage.html): `POST metrics-system.cpu-default/_disk_usage?run_expensive_tasks=true`
* Query usage: As soon as some query / visualisations are run, to get details on which fields are queried how, use the [field usage stats API](https://www.elastic.co/guide/en/elasticsearch/reference/current/field-usage-stats.html): `GET metrics-system.cpu-default/_field_usage_stats`
* Disk usage total: For completeness, there is also the [stats API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html) that gives overall disk usage: `GET metrics-system.cpu-default/_stats`


With the first 3 APIs, we can have for a single field what is mapped, what is stored and what is queried.


## Example

As our example, we use the `metrics-system.memory-default` data stream and the field `system.memory.actual.used.pct`. This is used in the System Metrics Overview dashboard:

![Screenshot 2024-04-12 at 09 22 56](https://github.com/elastic/elastic-package/assets/244900/195a67f0-17b9-4bbf-8e96-33100bde56ef)


Fist we want to see how the field is mapped an we run `GET metrics-system.memory-default/_mapping`. The result is as following:

```
...
"system": {
  "properties": {
    "memory": {
      "properties": {
        "actual": {
          "properties": {
            "free": {
              "type": "long",
              "meta": {
                "unit": "byte"
              },
              "time_series_metric": "gauge"
            },
            "used": {
              "properties": {
                "bytes": {
                  "type": "long",
                  "meta": {
                    "unit": "byte"
                  },
                  "time_series_metric": "gauge"
                },
                "pct": {
                  "type": "scaled_float",
                  "meta": {
                    "unit": "percent"
                  },
                  "scaling_factor": 1000,
                  "time_series_metric": "gauge"
                }
              }
            }
          }
        },
...
```

The field is mapped as `scaled_float` and `"time_series_metric": "gauge"`. Now lets have a look at the disk usage.

```
POST metrics-system.memory-default/_disk_usage?run_expensive_tasks=true
```

The following info is returned:

```
"system.memory.actual.used.pct": {
  "total": "818b",
  "total_in_bytes": 818,
  "inverted_index": {
    "total": "0b",
    "total_in_bytes": 0
  },
  "stored_fields": "0b",
  "stored_fields_in_bytes": 0,
  "doc_values": "818b",
  "doc_values_in_bytes": 818,
  "points": "0b",
  "points_in_bytes": 0,
  "norms": "0b",
  "norms_in_bytes": 0,
  "term_vectors": "0b",
  "term_vectors_in_bytes": 0,
  "knn_vectors": "0b",
  "knn_vectors_in_bytes": 0
},
```

As we see in the above, all the storage is used by `doc_values` which is expected as the field is set as a tiem_series_metrics. No inverted_index exists.

Last, we check if the field is also queried as expected:

```
GET metrics-system.memory-default/_field_usage_stats
```

```
"system.memory.actual.used.pct": {
  "any": 3,
  "inverted_index": {
    "terms": 0,
    "postings": 0,
    "term_frequencies": 0,
    "positions": 0,
    "offsets": 0,
    "payloads": 0,
    "proximity": 0
  },
  "stored_fields": 0,
  "doc_values": 3,
  "points": 0,
  "norms": 0,
  "term_vectors": 0,
  "knn_vectors": 0
}
```

The result is good news, only queries are run on `doc_values` and queries are run.

## Next steps

If we look at other fields in `metrics-system.memory-default`, we can find fields which are mapped and ingested, but not used in the queries / visualisation. The basic idea I have is that could be done for an integration when running a test:

* Load assets
* Run queries (dashboards, visualisation). API to extract it (https://github.com/elastic/kibana/pull/173416) from @drewdaemon could help here
* Compare results if all fields are used and mapped correctly. If not, potentially remove fields or change mapping properties

## Links

* What if Elasticsearch could do this on the fly? https://github.com/elastic/elasticsearch/issues/87469


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize mappings/storage/query of datasets #1764

ruflin
openedon Apr 12, 2024

Compare what is mapped and ingested vs used

Example

Next steps

Links

Assignees

Labels

Type

Projects

Milestone

Relationships

Development