-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal Amendment: UTF-8 migration #30
Conversation
TODO: need to clarify expansion method, otherwise first draft ready to go |
I added @roidelapluie and @bwplotka as reviewers of the original UTF-8 design doc, and @gouthamve as he was involved discussing the approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is quite a smart way to have better UX on those, often already ingested, Otel yolo underscore metrics. Great work!
The collisions are tricky case here indeed. Hard to say if there will be case that might surprise users. We don't have specific examples that semantically makes sense. Given we cannot rule out the surprises completely, I would vote for dropping that storage trick. It's quite a complexity, yet still prone to surprise situations. Additionally, if we have block with mixed client versions (prone to collisions) and it's get compacted into 2 days or 2w block (e.g. Thanos), that means whole 2w is prone to collisions right? Or we change compaction code too?
I would say let's make unsupported input for this migration feature explicitly unsupported, not "sometimes" unsupported. I wonder if we can simply rely on the query API parameter (and feature flag) to allow some debuggability/quick opt-out, without storage deduction and be explicit in docs etc
Which brings an interesting question, would this feature be part of an official PromQL spec? (aka require every PromQL exposed backend to have "UTF-8 compatibility broad" lookup, to be compatible) or is it an optional feature?
Having explicit regex makes sense.
Quick suggestions:
- Put storage trick to alternatives as some future improvement if we know about specific collision case which makes sense.
- Propose query API param (?)
- Clarify if this is part of the official PromQL spec or just a migration friendly Prometheus feature
|
||
### Proposed Solution | ||
|
||
To help alleviate this confusion we first propose to bump the version number in the tsdb meta.json file. On a per-block basis, the query code can check the version number and know if the data was written with an old version of the database code. This helps distinguish the first case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we don't need to bump as the existence of the new entry would tell us if new or old code was used, right? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version number bump indicates if the block was written by utf-capable prometheus, but it does not indicate if the clients sent utf8 metrics or pre-escaped metrics (or if the block contains a combination of both).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more, I think the implementation could get complicated.
Right now, the TSDB interface has no concept of "ingestion-version", and I would be uncomfortable plumbing it through for just this use-case.
Instead could we do the following:
-promql.utf8_migration.enabled=true
-promql.utf8_migration.until=<date-time> (optional)
The migration would then become:
- Set the first flag
- Move everything over to UTF-8
- Wait a bit
- Set the second flag
- Once retention is passed, unset the first flag
This might work tbh. The challenge is likely to be that step 2. will take a long time as most folks don't have control over the clients.
|
||
We must consider edge cases in which a blocks database has persisted metrics or labels that may have been written by different client versions. There are multiple ways this can (and will) happen: | ||
|
||
* A newer client persists names to an older database version. In this case, names would be escaped with the U__ syntax. If the database is upgraded, newer blocks will be written in UTF-8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by database upgrade? Do you mean a new Prometheus version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's what I meant here, I can fix the language
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Yea, I like that simple logic @gouthamve proposed. |
is the idea that 100% of queries would get replacement-expansion, rather than trying to figure out which blocks are which? |
or does the date-time determine for what periods the expansion will take place? |
Both (: If the |
Yup, what Bartek explained. |
Addressed notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Epic, thanks! Final suggestions 🤗 (I hope), otherwise LGTM
### Proposed Solution | ||
|
||
For queries to return correct data we must differentiate the three cases above, and to do that we first propose to bump the version number in the tsdb meta.json file. | ||
On a per-block basis, the query code can check the version number and know if the data was written with an old version of the Prometheus code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how this is useful. What query will do if it sees old block? When UTF-8 is queried? I think we need to still do this "migration broad search" for those, no? As we don't know what escape method clients used.
Is it as an optimization to immediately say no result on those blocks for non escaped UTF-8 lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I suppose the only real benefit is removing the UTF-8 query from the list of possibilities. The other escapings may all be possible. Do you think that maybe we don't need a version number bump at all and can just use the flag/date-based logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I am fine with both.
It feels like small change to bump version and might unlock some optimizations, so I am fine keeping this up, just wanted to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yeah, I guess the only optimization is that we never have to run the original UTF-8 query on blocks with an older version. Not sure if that is worth the effort of versioning the blocks. But maybe, as @bwplotka said, the effort is actually quite low, and we might see additional uses of the version number later.
This proposal expands upon the existing UTF-8 proposal, providing more detail into how we plan to handle the migration of metrics data, ensuring that it remains queryable during the transition period Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Change approach, use flags instead of trying to auto-detect mixed blocks. Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Owen Williams <owen-github@ywwg.com> Signed-off-by: Owen Williams <owen.williams@grafana.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Owen Williams <owen-github@ywwg.com> Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
had to fix some commit signing, hopefully I didn't break anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! 💪🏽
I assume the answer is NO, it's optional backend feature and not part of the PromQL spec (: |
Signed-off-by: Owen Williams <owen.williams@grafana.com>
I agree, I think this is an optional feature and not a requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much. I think this is substantially good to go. My comments are mostly about wordsmithing and clarifications.
|
||
All of these situations can be summarized as follows: | ||
|
||
1. **Old Data** -- Data written with old Prometheus code: all names are guaranteed not to be UTF-8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like it includes the case where an old Prometheus has ingested from new producers (and names might include escaped names).
Or is this referring to old Prometheus and old producers, so even escaping in names can be ruled out?
Could you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having read through all the below, I would say what's meant here is "if at all, we will have escaped names, so we would never query for UTF-8 names, but only try out the specified escaping schemas".
Still, I think the difference to the mixed data case should be made clearer.
In principle, there might be a scenario where the user knows for sure that they won't have any escaping at all, for example if they had a pure Prometheus stack so far (no OTel etc.), but they would like to use UTF-8 names, so they still have a period of mixed versions deployed (new and old producers, new and old ingesters). I guess it's fine to not implement a specific optimization for that use case (which would be that we don't need a broad search for the old blocks at all), but it would be good if that case is described and that we disregard it as an informed decision.
### Proposed Solution | ||
|
||
For queries to return correct data we must differentiate the three cases above, and to do that we first propose to bump the version number in the tsdb meta.json file. | ||
On a per-block basis, the query code can check the version number and know if the data was written with an old version of the Prometheus code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yeah, I guess the only optimization is that we never have to run the original UTF-8 query on blocks with an older version. Not sure if that is worth the effort of versioning the blocks. But maybe, as @bwplotka said, the effort is actually quite low, and we might see additional uses of the version number later.
Signed-off-by: Owen Williams <owen.williams@grafana.com>
all notes addressed, please let me know if I missed anything |
merge ping? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I assume all other reviewers are also happy at this point. Please follow up if not.
This proposal expands upon the existing UTF-8 proposal, providing more detail into how we plan to handle the migration of metrics data, ensuring that it remains queryable during the transition period. This is critical to smooth migration of OTEL data.