Skip to content

Formalize the concept of data tiers in Elasticsearch #60848

Closed
@dakrone

Description

@dakrone

We currently have the ability for users to split their deployments into tiers based on thing like node attributes, and manually move data between the tiers within ILM. We'd like to take this one step further and formalize the concept of data tiers within Elasticsearch.

Tasks


Context

So why formalize tiers into Elasticsearch (and beyond)? There are a number of advantages to doing this.

  • By formalizing this inside of Elasticsearch itself we shift from descriptive best practices to prescriptive best practices. Instead of a million ways to configure hot/warm/cold, we prescribe our preferred solution.
  • This allows us to be consistent in our documentation for on-prem as well as on Cloud, we don’t need to make up attributes that may differ, as we can refer to the actual role names and configuration.
  • This solution allows us to tell a story not only in our documentation, but also in our out-of-the-box configuration. The idea of data having a lifecycle is concrete instead of abstract based on general purpose constructs.
  • A data stream already encapsulates some of the lifecycle of data in that we prevent certain actions to the write index, allowing them only to non-write indices in the stream. This would only be strengthened by having tiers available as a first class feature.
  • A better out of the box experience for users using time-series data
  • A user now has less to configure in their ILM policy and templates, as data can shift tiers automatically.
  • Since we have a distinction between tiers, we have the freedom to be more aggressive with our default ILM policies. For example, we can start to include policies that automatically freeze indices on a frozen tier, or use searchable snapshots by default, because tiers are now a first class idea.
  • Autoscaling can be tier-aware. Rather than having to scale based on a node attribute and not knowing whether data is even respecting that attribute by default (since we don’t respect attribute-based allocation by default), autoscaling can differentiate between the different tiers, scaling only a specific part up or down as needed.

Minimum Viable Product

There are a set of things that we’d like to provide for the MVP for formalizing data tiers. This includes functionality for the tiering itself as well as uses within other parts of ES (like ILM). While the features can be expanded at a later time, this is a good starting place for the MVP.

Add tiers to Elasticsearch

The first step will be adding tiers to Elasticsearch itself. We can add the following roles to Elasticsearch:

  • data_hot
  • data_warm
  • data_cold
  • data_frozen

These roles are not mutually exclusive. When a user doesn’t specify any of these roles, but does specify the “data” role (or uses the default node role which includes “data”), we will treat the node as if it has all of the data_* roles.

Not only do we need to make these tiers available for setting, we need to make them accessible for allocation, we currently have a set of built-in attributes that users can specify in our allocation APIs: _name, _host_ip, _publish_ip, _ip, _host, and _id. I propose that we add another: _tier. This new attribute could be used manually for both the cluster and index level allocation as well as within ILM. This way we could avoid having to introduce a new set of allocation deciders specifically for moving data within the different tiers, we also already have the infrastructure for include, exclude, and require for a given set of _tier attributes.

An example configuration for this would include the following in elasticsearch.yml:

node.roles: [“master”, “data_hot”, “ingest”]

One of the first uses of the new tiers will be ILM. Currently ILM has a lifecycle that includes the hot, warm, and cold phases and their actions. Making ILM aware of our tiers is a two step process: adding the tier as a new phase, and then making ILM perform the automatic migration.

Adding a “frozen” phase to ILM

Adding a frozen phase also includes adding a set of actions that are allowed as well as the parsing for the phase itself. The “frozen” phase will occur after the “cold” phase but before the “delete” phase. The list of allowed actions for the frozen phase in their execution order will be:

  • set_priority
  • unfollow
  • allocate
  • freeze
  • searchable_snapshot

Migrating data between tiers automatically

Currently ILM doesn’t migrate any data between tiers automatically, though this is something that has tripped up users in the past (they expect it to move the data, but it doesn’t). The plan is to make ILM automatically move data to the tier corresponding to the ILM phase, unless there is an existing allocate action in the phase with an allocation set (not just a replica change)

This migration should be implemented as an injected step (similar to the way we inject the “unfollow” step in phases) that happens as the first step in a phase, that way the user can monitor it through the existing ILM explain API as well as allowing it to be re-run when a user moves back to a phase. This injected step should fail fast if there are no nodes corresponding to the given phase available in the cluster, and then be retried the next time the ILM policy is executed.

We should add a way to opt-out of this automatic migration, rather than requiring a user to have a custom allocation as the only way to opt out.

Allocate new indices on hot nodes

In addition to making tiers something a user manages, we want new data to automatically be allocated to “hot” nodes by default. This will not affect the out-of-the-box case where each node is of type “data”, because those are considered hot nodes.

This should be implemented as default settings for the index that set:

{
  "index.routing.allocation.include._role": "data_hot"
}

As the settings for a brand new index. This has the nice benefit of easily allowing a user to override these default settings in their template, or manually when creating the index. These are the same settings that will be updated by ILM when migrating between phases.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions