Description
Goal
The goal of this effort is to provide access to older Elasticsearch data, for compliance or regulatory reasons, the occasional lookback or investigation, or to rehydrate parts of it. Access to the data is expected to be very infrequent, and can therefore happen with limited performance and query capabilities. Running old versions of Elasticsearch to access the old data is not practical as it would require running outdated and unsupported software.
A non-goal of this effort is to fully solve the major version upgrade problem. "Snapshots as simple archives" is an important first step however towards longer term data retention and access. It will allow some users to refrain from upgrading their archived data, and refraining from upgrading is probably the simplest upgrade option.
Solution
Snapshots have long been used for backup purposes. With this new feature, they can be used for archival purposes as well now. Elasticsearch will have the ability to access older snapshot repositories and the data therein. In addition, some basic query and aggregation capabilities are available, and it allows reindexing the data into newer Elasticsearch clusters without having the old cluster present. It provides the guarantee that the data put into Elasticsearch (and stored in snapshots) does not have an EOL, but can be accessed for a long time into the future (even if at reduced speed). The data can either be restored with read-only access, or the data can be accessed via searchable snapshots so that the archived data won't even need to fully reside on local disks for access.
Phases
Phase 0: Prototype
- Basic implementation showing feasibility (Allow reading _source from older snapshots #77542)
Phase 1: MVP (target release: 8.3)
Allow Elasticsearch 8 nodes to access snapshot repositories written by previous Elasticsearch versions going back to Elasticsearch 5.0. Allow restoring indices from snapshots in the old repository into the Elasticsearch 8 cluster as well as mounting them as searchable snapshots. Allow basic query and aggregation capabilities based on postings / doc values as well as runtime fields on these indices.
Supported field types
Old mappings are imported as much "as-is" as possible into Elasticsearch 8, but only provide regular query / aggregation capabilities on a select subset of fields:
- Numeric types
boolean
typeip
typegeo_point
typedate
types: the dateformat
setting on date fields is supported in so far as it behaves similarly across these versions. In case it is not, this field can be updated on legacy indices so that it can be changed by a user if need be.keyword
type: thenormalizer
setting on keyword fields is supported in so far as it behaves similarly across these versions. In case it is not, this field can be updated on legacy indices if need be.text
type: scoring capabilities are limited, and all queries return constant scores that are equal to 1.0. Theanalyzer
settings on text fields are supported in so far as they behave similarly across these versions. In case they do not, they can be updated on legacy indices if need be.- Multi-fields
- Field aliases
object
fields- some basic metadata fields, e.g.
_type
for querying Elasticsearc 5 indices - runtime fields
_source
field
Elasticsearch 5 indices with mappings that have multiple mapping types are collapsed together on a best-effort basis before they are imported.
In case the auto-import of mappings does not work, or the new version can't make sense of the mapping, it falls back to a lightweight import of the mapping where the original mapping is stored in the _meta section of the imported index's mapping, and relies on the user to put the relevant mapping parts manually in place.
Supported APIs
Archive indices are read-only, and provide data access via the search and field capabilities APIs. They do not support the Get API nor any write APIs.
Archive indices allow running queries as well as aggregations in so far as they are supported by the given field type (see above).
Due to _source
access the data can also be reindexed to a new index that has full compatibility with the current Elasticsearch version.
List of tasks:
- Allows listing repositories with snapshots down to ES 5.0 (Allow listing older repositories #78244, Remove extra repo flag to access archive indices #84222)
- Add recovery infrastructure hook to snapshot restore code to allow restoring older indices (Add recovery infrastructure hook to work with older Lucene indices #81056)
- Add Lucene 6 (ES 5) and Lucene 7 (ES 6) codec support (Add codec support for Lucene 6 and 7 versions #81258)
- Add support for peer recovery (Make peer recovery work with archive data #81522)
- Adapt versioning logic for handling index metadata from older versions (minimumIndexCompatibilityVersion etc.) (Introduce index.version.compatibility setting #83264)
- Handle loading older settings (e.g. only import supported settings / allow overrides) -> handle later
- Make archive indices read-only and don't allow removing write block Use write block on archive indices #85102
- Add license checks (License checks for archive tier #83894)
- Add feature usage stats (Feature usage actions for archive #83931)
- Add basic mapping support
- copy existing mapping to _meta section (Copy old mappings to _meta section #83041)
- auto-map some of it as doc-value-only fields or runtime fields (Handle legacy mappings with placeholder fields #85059)
- support valueFetcher in PlaceHolderFieldMapper to avoid a search retrieving multiple fields (like "fields": ["*"]) to fail (Allow field retrieval on placeholder fields #86289)
- Add support for properly blending out (soft-)deleted documents: support for doc values (Add doc values support for ES 5 and ES 6 #82207)
- Add support for source-only repositories (Check older source-only repos are supported #82213)
- Add support for running doc-value-based queries on the following fields (Doc-value-only fields #52728), extended BWC testing (Test doc-value-based searches on older indices #83844)
- numeric (Allow docvalues-only search on number types #82409)
- date (Allow doc-values only search on date types #82602)
- keyword (Allow doc-values only search on keyword fields #82846, Implement all queries on doc-values only keyword fields #83404)
- geo_point (Allow doc-values only search on geo_point fields #83395)
- ip (Allow doc-values only search on ip fields #82929)
- boolean (Allow doc-values only search on boolean fields #82925)
- text (Add text field support to archive indices #86591)
- Support older postings formats (Support older postings formats #85303)
- Add support for handling multiple mapping types (ES 5)
- importing mapping (Copy old mappings to _meta section #83041)
- running queries on _type and returning _type under search results (Provide access to _type in 5.x indices #83195, Avoid duplicate _type fields in v7 compat layer #83239)
- disable GET API (throw error) (Disable get API on legacy indices #86594)
- avoid searching archive indices if data does not match timestamp-based date range query via basic codec support for points (only metadata) (Add points metadata support for archive indices #86655)
- Add reminder to add BWC codecs on major version upgrade (Add reminder to add BWC codecs on major version upgrade #86844)
- Add documentation and mark as release highlight (Docs for snapshots as simple archives #86261)
Phase 2: Cluster management & ILM integration
Phase 1 still requires users during a major version upgrade to take extra steps: snapshot the data that can't make it to the next major version, and delete it from the cluster, then do the upgrade, and finally restore / mount the data against as legacy indices. The goal of phase 2 is to automatize some of this, making it easier for user to go through a major version upgrade. Some steps could include providing an ILM integration so that indices can be transitioned to an "archive" where they will be limited to doc-values / source-only access, as well as allow users to upgrade to the next major version by auto-converting indices to archival.
Phase 2 won't be worked on immediately and is captured in #87291