Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal Amendment: UTF-8 migration #30

Merged
merged 10 commits into from
Dec 5, 2023
Merged
174 changes: 174 additions & 0 deletions proposals/2023-11-13-utf8-migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Amendment: Enabling Smooth Migration of UTF-8 Metric and Label names

* **Owners:**
* `<@author: owen.williams@grafana.com>`

* **Implementation Status:** N/A

* **Related Issues and PRs:**
* [GH Issue](https://github.com/prometheus/prometheus/issues/12630)
* [PR](https://github.com/grafana/mimir-prometheus/pull/476) (TODO: needs to be rebased on upstream prom)

* **Other docs or links:**
* [Parent Proposal](https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md)
* [Background Discussion / Justification](https://docs.google.com/document/d/1yFj5QSd1AgCYecZ9EJ8f2t4OgF2KBZgJYVde-uzVEtI/edit). Please read this document first for more information on the chosen solution.
ywwg marked this conversation as resolved.
Show resolved Hide resolved

> TL;DR: This is an amendment to the existing UTF-8 proposal that provides more detail in the backwards compatibility and migration scenarios.

## Why

## Goals

* Allow queries to transparently read data from blocks generated by combinations of old and new versions of tsdb and scraping clients.
ywwg marked this conversation as resolved.
Show resolved Hide resolved
* Minimize edge cases where behavior is undefined or suboptimal or risks bad results.

### Audience

The audience for this amendment are users that are planning to migrate existing Prometheus deployments to add support for UTF-8 metric and label names who want to ensure continuity in query behavior through the upgrade process.

## Non-Goals

We do not promise smooth accommodation of every edge case, especially pathological ones (see Name Collisions below).
In those instances, users may not be able to turn on UTF-8 support, or may need to rename metrics or labels.

## How

Given a query for a UTF-8 metric or label name, the tsdb will look for that name in on-disk blocks whether those blocks were written in native UTF-8 or either of two supported name-escaping patterns.
Those series will be located even in cases when a single block has one metric written in more than one way.
The tsdb will differentiate those blocks based on entries in the meta.json and a new flag.

### Mixed-Format Scenarios

We must consider edge cases in which a blocks database has persisted metrics or labels that may have been written by different client versions. There are multiple ways this can (and will) happen:
ywwg marked this conversation as resolved.
Show resolved Hide resolved

* A newer client persists names to an older Prometheus version. In this case, names would be escaped with any of the available escaping methods. If Prometheus is upgraded, newer blocks will be written in UTF-8.
ywwg marked this conversation as resolved.
Show resolved Hide resolved
* A newer Prometheus receives names from an older client, which is later upgraded. In this case, older names might be escaped using the replace-with-underscores method, and newer names will be UTF-8. This will often happen when Prometheus is receiving Open Telemetry metrics.
* A newer Prometheus receives names from a mix of new and old clients, in which case the same block could contain escaped and UTF-8 data representing the same intended names.

At query time, there will be a problem: some data may be written with UTF-8 and other data was written with an escaping format.
The query code will not know which encoding to look for.
In order to ensure consistent querying, the backwards-compatibility design must account for these scenarios, making trade-offs when needed.

All of these situations can be summarized as follows:

1. **Old Data** -- Data written with old Prometheus code: all names are guaranteed not to be UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like it includes the case where an old Prometheus has ingested from new producers (and names might include escaped names).

Or is this referring to old Prometheus and old producers, so even escaping in names can be ruled out?

Could you clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having read through all the below, I would say what's meant here is "if at all, we will have escaped names, so we would never query for UTF-8 names, but only try out the specified escaping schemas".

Still, I think the difference to the mixed data case should be made clearer.

In principle, there might be a scenario where the user knows for sure that they won't have any escaping at all, for example if they had a pure Prometheus stack so far (no OTel etc.), but they would like to use UTF-8 names, so they still have a period of mixed versions deployed (new and old producers, new and old ingesters). I guess it's fine to not implement a specific optimization for that use case (which would be that we don't need a broad search for the old blocks at all), but it would be good if that case is described and that we disregard it as an informed decision.

2. **Mixed Data** -- Data written with new Prometheus code by one or more old clients (and possibly new clients as well): No guarantees, some names could be escaped, others not.
3. **New Data** -- Data written with new Prometheus code by new clients: all names are guaranteed to be UTF-8-compatible.

### Time Scope

The issue of mixed-format blocks will persist for the retention period of the tsdb.
For some deployments this means only 14 days, for others it may be on the order of years of persisted old data.

### Proposed Solution

For queries to return correct data we must differentiate the three cases above, and to do that we first propose to bump the version number in the tsdb meta.json file.
On a per-block basis, the query code can check the version number and know if the data was written with an old version of the Prometheus code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this is useful. What query will do if it sees old block? When UTF-8 is queried? I think we need to still do this "migration broad search" for those, no? As we don't know what escape method clients used.

Is it as an optimization to immediately say no result on those blocks for non escaped UTF-8 lookup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I suppose the only real benefit is removing the UTF-8 query from the list of possibilities. The other escapings may all be possible. Do you think that maybe we don't need a version number bump at all and can just use the flag/date-based logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I am fine with both.

It feels like small change to bump version and might unlock some optimizations, so I am fine keeping this up, just wanted to clarify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah, I guess the only optimization is that we never have to run the original UTF-8 query on blocks with an older version. Not sure if that is worth the effort of versioning the blocks. But maybe, as @bwplotka said, the effort is actually quite low, and we might see additional uses of the version number later.

This helps distinguish the first case.

Secondly we will add two new flags to help define the range of dates that are affected by mixed blocks and will be used to distinguish the second case from the third.

* `-promql.utf8_broad_lookup.escape_formats`: This flag tells PromQL engine what escaping methods might have been previously used to escape UTF-8 characters. This is then used to transparently repeat series lookups for metric names or label names when UTF-8 characters are spotted, for each escaping format. Available values will be a short enum representing underscores, U__, or dots-only escaping.
* `-promql.utf8_migration.until=<date-time>`: This flag indicates the latest date-time (inclusive) for blocks that may contain mixed data. Any data after this moment are exclusively UTF-8.
ywwg marked this conversation as resolved.
Show resolved Hide resolved

#### Migration Timeline

A Prometheus migration to UTF-8 will follow this timeline:

1. Prometheus is upgraded and UTF-8 support enabled. The `-promql.utf8_broad_lookup.escape_formats` is turned on immediately, enabling the multi-lookup behavior and listing the possible escaping schemes.
2. Clients are gradually upgraded to UTF-8.
3. `-promql.utf8_migration.until` is set to the last date-time when a non-UTF-8 client sent data.
ywwg marked this conversation as resolved.
Show resolved Hide resolved
4. Wait for the retention period to elapse such that the migration-until date is expired (could be years).
5. The migration is complete. Remove `-promql.utf8_broad_lookup.escape_formats` and `-promql.utf8_migration.until` as they are no longer needed.
ywwg marked this conversation as resolved.
Show resolved Hide resolved

### Querying Mixed Blocks

The last major challenge is correctly returning data for queries of blocks that contain mixed data.
For the mixed-format scenarios, at query time, we will look for **all possible** escapings of a name in order to locate the correct data.
We propose to do this by expanding a lookup for a UTF-8 metric or label name into a limited set of possible escapings:

1. **UTF-8**
2. **underscore-replaced**: All unsupported characters are converted to underscores.
3. **U__ escaping**: As described in the UTF-8 proposal, strings with invalid characters can be escaped by prepending `U__` and replacing all invalid characters with `_[UTF8 value]_`.
4. **[Datadog proxy](https://github.com/grafana/mimir-proxies/blob/main/pkg/datadog/ddprom/naming.go#L30-L34) escaping pattern**: "`.`" becomes "`_dot_`" and "`_`" becomes "`__`".
ywwg marked this conversation as resolved.
Show resolved Hide resolved

In PromQL, the expansion would look something like this under the hood:

User-generated query:

`{"my.utf8.metric", "my.label"="value"}`

Expanded queries:

* `{"my.utf8.metric", "my.label"="value"}`
* `{"my_utf8_metric", "my_label"="value"}`
* `{"U__my_2E_utf8_2E_metric", "U__my_2E__label"="value"}`
* `{"my_dot_utf8_dot_metric", "my_dot_label"="value"}`

The escape_formats flag mentioned above enables the behavior and specifies which of the escaping schemes might be in use.
If an administrator knows that no metrics will use the `U__` pattern, it can be safely skipped.
Hypothetically, if additional replacement patterns are found, they could be easily added to the list of possible configuration options as a minor update.

Redundant lookups will increase query time, but the hope is that index lookups are fast enough that the penalty will be small.
We will do performance testing to identify possible issues.

### Regex lookups

If the user is querying for metrics using a regex lookup for the `__name__` label, attempting to rewrite that query to account for other name encodings would be overly complex and error-prone.
Therefore we will not try to rewrite the regex to account for multiple escaping methods and the regex will be passed through as-is.
Users will need to write custom regex queries to account for metric name changes during the transition period in this case.
Since regex queries on metrics names are relatively rare and the domain of advanced users, we feel this is an acceptable approach.

### Name Collisions

In most cases, we do not anticipate bad query results due to name collisions in the case where names are escaped by an old client using the underscore method.
This is because collisions would occur at write time, when the colliding names are written to the database.
Any problems with collisions will occur well before a migration to UTF-8 support takes place.
Therefore, behavior due to name collisions due to underscore replacement is undefined.

Hypothetically, there could be collisions in the following situation:

1. A database has incoming names generated by an old client that escapes names with underscores.
2. That database also has incoming names written in UTF-8 by a new client.
3. There is a UTF-8 name that collides with a similar name sent by the old client.

For example, an old client is sending "service.name", and that is getting escaped to "service_name" by that client at write time.
And then, a newer client is sending "service/name" as native UTF-8.
The error occurs when the user tries to query for "service/name": because an old client was writing to the same blocks as the new one, the query will be expanded to look for "service_name" and will accidentally grab the metrics meant for "service.name".

The short answer to avoiding this scenario is **don't do that**. Specifically: If possible, if there are any old clients present, do not construct metrics or labels which could cause collisions; and if that is unavoidable, don't mix old and new clients together.

As long as all the clients are new, users do not need to worry about collisions -- "service.name" and "service/name" will be stored separately and the queries will never have to be expanded to include the escaped "service_name" possibility.

This situation seems contrived-enough that we are comfortable not supporting it.

## Discarded Approaches

### Record the oldest client version used to write data

A previous draft suggested recording the oldest client version used to ingest data in order to determine which blocks might have mixed data.
This approach was overly complicated and would require a lot of plumbing to make it work.
There were also potential issues with block compaction and trying to make sure that the metadata is merged correctly.
Ultimately we decided that having administrators declare transition dates was an easier approach.

### Rewrite Old Data

We could have required that users rewrite their tsdb blocks to "upgrade" them to UTF-8 and undo the escaping.
This approach seems tedious, difficult, and dangerous -- what if something goes wrong during rewriting?
Requiring massive data rewrites is not a reasonable ask of users.

### Lookup Table / Per-Name Config

We considered recording a lookup table or per-name configuration that would describe how UTF-8 metrics and labels might be stored in old data blocks.
This approach would be faster than doing query expansion, but would create extra operational overhead -- lookup tables would have to be correct and exhaustive.

Because names are stored in the index, query expansion is not expensive enough to justify the extra operational overhead.

### No Migration -- Write Both Versions

We very briefly considered the idea of having the tsdb write all names for a name as long as the user configured it that way.
ywwg marked this conversation as resolved.
Show resolved Hide resolved
That way queries for both the native UTF-8 name and the escaped name would succeed.
When the migration was complete, users could turn off double-writing and only write UTF-8.

This approach would cause an explosion of on-disk usage.
As disk is one of the most expensive resources, this approach was quickly discarded.
Loading