prometheus · beorn7 · Dec 5, 2023 · Nov 15, 2023 · Nov 17, 2023 · Nov 21, 2023
diff --git a/proposals/2023-11-13-utf8-migration.md b/proposals/2023-11-13-utf8-migration.md
@@ -0,0 +1,174 @@
+# Amendment: Enabling Smooth Migration of UTF-8 Metric and Label names
+
+* **Owners:**
+  * `<@author: owen.williams@grafana.com>`
+
+* **Implementation Status:** N/A
+
+* **Related Issues and PRs:**
+  * [GH Issue](https://github.com/prometheus/prometheus/issues/12630)
+  * [PR](https://github.com/grafana/mimir-prometheus/pull/476) (TODO: needs to be rebased on upstream prom)
+
+* **Other docs or links:**
+  * [Parent Proposal](https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md)
+  * [Background Discussion / Justification](https://docs.google.com/document/d/1yFj5QSd1AgCYecZ9EJ8f2t4OgF2KBZgJYVde-uzVEtI/edit). Please read this document first for more information on the chosen solution.
+
+> TL;DR: This is an amendment to the existing UTF-8 proposal that provides more detail in the backwards compatibility and migration scenarios.
+
+## Why
+
+## Goals
+
+* Allow queries to transparently read data from blocks generated by combinations of old and new versions of tsdb and scraping clients.
+* Minimize edge cases where behavior is undefined or suboptimal or risks bad results.
+
+### Audience
+
+The audience for this amendment are users that are planning to migrate existing Prometheus deployments to add support for UTF-8 metric and label names who want to ensure continuity in query behavior through the upgrade process.
+
+## Non-Goals
+
+We do not promise smooth accommodation of every edge case, especially pathological ones (see Name Collisions below).
+In those instances, users may not be able to turn on UTF-8 support, or may need to rename metrics or labels.
+
+## How
+
+Given a query for a UTF-8 metric or label name, the tsdb will look for that name in on-disk blocks whether those blocks were written in native UTF-8 or either of two supported name-escaping patterns.
+Those series will be located even in cases when a single block has one metric written in more than one way.
+The tsdb will differentiate those blocks based on entries in the meta.json and a new flag.
+
+### Mixed-Format Scenarios
+
+We must consider edge cases in which a blocks database has persisted metrics or labels that may have been written by different client versions. There are multiple ways this can (and will) happen:
+
+* A newer client persists names to an older Prometheus version. In this case, names would be escaped with any of the available escaping methods.  If Prometheus is upgraded, newer blocks will be written in UTF-8.
+* A newer Prometheus receives names from an older client, which is later upgraded. In this case, older names might be escaped using the replace-with-underscores method, and newer names will be UTF-8. This will often happen when Prometheus is receiving Open Telemetry metrics.
+* A newer Prometheus receives names from a mix of new and old clients, in which case the same block could contain escaped and UTF-8 data representing the same intended names.
+
+At query time, there will be a problem: some data may be written with UTF-8 and other data was written with an escaping format.
+The query code will not know which encoding to look for.
+In order to ensure consistent querying, the backwards-compatibility design must account for these scenarios, making trade-offs when needed.
+
+All of these situations can be summarized as follows:
+
+1. **Old Data** -- Data written with old Prometheus code: all names are guaranteed not to be UTF-8.
+2. **Mixed Data** -- Data written with new Prometheus code by one or more old clients (and possibly new clients as well): No guarantees, some names could be escaped, others not.
+3. **New Data** -- Data written with new Prometheus code by new clients: all names are guaranteed to be UTF-8-compatible.
+
+### Time Scope
+
+The issue of mixed-format blocks will persist for the retention period of the tsdb.
+For some deployments this means only 14 days, for others it may be on the order of years of persisted old data.
+
+### Proposed Solution
+
+For queries to return correct data we must differentiate the three cases above, and to do that we first propose to bump the version number in the tsdb meta.json file.
+On a per-block basis, the query code can check the version number and know if the data was written with an old version of the Prometheus code.
+This helps distinguish the first case.
+
+Secondly we will add two new flags to help define the range of dates that are affected by mixed blocks and will be used to distinguish the second case from the third.
+
+* `-promql.utf8_broad_lookup.escape_formats`: This flag tells PromQL engine what escaping methods might have been previously used to escape UTF-8 characters. This is then used to transparently repeat series lookups for metric names or label names when UTF-8 characters are spotted, for each escaping format. Available values will be a short enum representing underscores, U__, or dots-only escaping.
+* `-promql.utf8_migration.until=<date-time>`: This flag indicates the latest date-time (inclusive) for blocks that may contain mixed data. Any data after this moment are exclusively UTF-8.
+
+#### Migration Timeline
+
+A Prometheus migration to UTF-8 will follow this timeline:
+
+1. Prometheus is upgraded and UTF-8 support enabled. The `-promql.utf8_broad_lookup.escape_formats` is turned on immediately, enabling the multi-lookup behavior and listing the possible escaping schemes.
+2. Clients are gradually upgraded to UTF-8.
+3. `-promql.utf8_migration.until` is set to the last date-time when a non-UTF-8 client sent data.
+4. Wait for the retention period to elapse such that the migration-until date is expired (could be years).
+5. The migration is complete. Remove `-promql.utf8_broad_lookup.escape_formats` and `-promql.utf8_migration.until` as they are no longer needed.
+
+### Querying Mixed Blocks
+
+The last major challenge is correctly returning data for queries of blocks that contain mixed data.
+For the mixed-format scenarios, at query time, we will look for **all possible** escapings of a name in order to locate the correct data.
+We propose to do this by expanding a lookup for a UTF-8 metric or label name into a limited set of possible escapings:
+
+1. **UTF-8**
+2. **underscore-replaced**: All unsupported characters are converted to underscores.
+3. **U__ escaping**:  As described in the UTF-8 proposal, strings with invalid characters can be escaped by prepending `U__` and replacing all invalid characters with `_[UTF8 value]_`.
+4. **[Datadog proxy](https://github.com/grafana/mimir-proxies/blob/main/pkg/datadog/ddprom/naming.go#L30-L34) escaping pattern**: "`.`" becomes "`_dot_`" and "`_`" becomes "`__`".
+
+In PromQL, the expansion would look something like this under the hood:
+
+User-generated query:
+
+`{"my.utf8.metric", "my.label"="value"}`
+
+Expanded queries:
+
+* `{"my.utf8.metric", "my.label"="value"}`
+* `{"my_utf8_metric", "my_label"="value"}`
+* `{"U__my_2E_utf8_2E_metric", "U__my_2E__label"="value"}`
+* `{"my_dot_utf8_dot_metric", "my_dot_label"="value"}`
+
+The escape_formats flag mentioned above enables the behavior and specifies which of the escaping schemes might be in use.
+If an administrator knows that no metrics will use the `U__` pattern, it can be safely skipped.
+Hypothetically, if additional replacement patterns are found, they could be easily added to the list of possible configuration options as a minor update.
+
+Redundant lookups will increase query time, but the hope is that index lookups are fast enough that the penalty will be small.
+We will do performance testing to identify possible issues.
+
+### Regex lookups
+
+If the user is querying for metrics using a regex lookup for the `__name__` label, attempting to rewrite that query to account for other name encodings would be overly complex and error-prone.
+Therefore we will not try to rewrite the regex to account for multiple escaping methods and the regex will be passed through as-is.
+Users will need to write custom regex queries to account for metric name changes during the transition period in this case. 
+Since regex queries on metrics names are relatively rare and the domain of advanced users, we feel this is an acceptable approach.
+
+### Name Collisions
+
+In most cases, we do not anticipate bad query results due to name collisions in the case where names are escaped by an old client using the underscore method.
+This is because collisions would occur at write time, when the colliding names are written to the database.
+Any problems with collisions will occur well before a migration to UTF-8 support takes place.
+Therefore, behavior due to name collisions due to underscore replacement is undefined.
+
+Hypothetically, there could be collisions in the following situation:
+
+1. A database has incoming names generated by an old client that escapes names with underscores.
+2. That database also has incoming names written in UTF-8 by a new client.
+3. There is a UTF-8 name that collides with a similar name sent by the old client.
+
+For example, an old client is sending "service.name", and that is getting escaped to "service_name" by that client at write time.
+And then, a newer client is sending "service/name" as native UTF-8.
+The error occurs when the user tries to query for "service/name": because an old client was writing to the same blocks as the new one, the query will be expanded to look for "service_name" and will accidentally grab the metrics meant for "service.name".
+
+The short answer to avoiding this scenario is **don't do that**. Specifically: If possible, if there are any old clients present, do not construct metrics or labels which could cause collisions; and if that is unavoidable, don't mix old and new clients together.
+
+As long as all the clients are new, users do not need to worry about collisions -- "service.name" and "service/name" will be stored separately and the queries will never have to be expanded to include the escaped "service_name" possibility.
+
+This situation seems contrived-enough that we are comfortable not supporting it.
+
+## Discarded Approaches
+
+### Record the oldest client version used to write data
+
+A previous draft suggested recording the oldest client version used to ingest data in order to determine which blocks might have mixed data.
+This approach was overly complicated and would require a lot of plumbing to make it work.
+There were also potential issues with block compaction and trying to make sure that the metadata is merged correctly.
+Ultimately we decided that having administrators declare transition dates was an easier approach.
+
+### Rewrite Old Data
+
+We could have required that users rewrite their tsdb blocks to "upgrade" them to UTF-8 and undo the escaping.
+This approach seems tedious, difficult, and dangerous -- what if something goes wrong during rewriting?
+Requiring massive data rewrites is not a reasonable ask of users.
+
+### Lookup Table / Per-Name Config
+
+We considered recording a lookup table or per-name configuration that would describe how UTF-8 metrics and labels might be stored in old data blocks.
+This approach would be faster than doing query expansion, but would create extra operational overhead -- lookup tables would have to be correct and exhaustive.
+
+Because names are stored in the index, query expansion is not expensive enough to justify the extra operational overhead.
+
+### No Migration -- Write Both Versions
+
+We very briefly considered the idea of having the tsdb write all names for a name as long as the user configured it that way.
+That way queries for both the native UTF-8 name and the escaped name would succeed.
+When the migration was complete, users could turn off double-writing and only write UTF-8.
+
+This approach would cause an explosion of on-disk usage.
+As disk is one of the most expensive resources, this approach was quickly discarded.