Skip to content

TSDB: unicode dimensions #83681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 9, 2022
Merged

TSDB: unicode dimensions #83681

merged 4 commits into from
Feb 9, 2022

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Feb 8, 2022

This fixes dimensions with unicode characters in their name. It also
make the sort order of these dimensions something we can explain:
convert all dimensions to utf-8 and sort by that.

@nik9000 nik9000 requested review from imotov and csoulios February 8, 2022 20:33
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

This fixes dimensions with unicode characters in their name. It also
make the sort order of these dimensions something we can explain:
convert all dimensions to utf-8 and sort by that.
@@ -36,7 +36,7 @@
* for generating the _tsid field. The map will be used by {@link TimeSeriesIdFieldMapper}
* to build the _tsid field for the document.
*/
private SortedMap<String, BytesReference> dimensionBytes;
private SortedMap<BytesRef, BytesReference> dimensionBytes;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to bytes ref here to make it really obvious these are sorted by utf-8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ulterior motive: I'll want them in utf-8 in more than one place.

"Dimension name must be less than [%s] bytes but [%s] was [%s].",
DIMENSION_NAME_LIMIT,
fieldName.utf8ToString(),
fieldName.length
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatter made a mess of this when I made the parameters a little longer so I switched it to String.format so human can read the message.

@@ -193,7 +197,7 @@ public static void encodeTsid(StreamOutput out, SortedMap<String, BytesReference
Map<String, Object> result = new LinkedHashMap<String, Object>(size);

for (int i = 0; i < size; i++) {
String name = in.readString();
String name = in.readBytesRef().utf8ToString();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the bug that made this not work - sometimes BytesRef and String don't write the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a value in hiding the fact that it's BytesRef all the way down and can this hurt us beyond serialization thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking this should be it's own class? Like one that just wraps BytesRef or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if I am completely honest, I was thinking why do we have 2 different maps with String in one case and BytesRef in another case that essentially represent the same data. But I think I like your interpretation much better. It has quite a bit of complexity that could probably benefit from some encapsulation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, maybe we need a TimeSeriesId class to represent this sort of data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to do it in this or a followup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's not mix refactoring with the actual change.

@nik9000 nik9000 added the test-release Trigger CI checks against release build label Feb 9, 2022
Copy link
Contributor

@csoulios csoulios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nik9000 nik9000 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Feb 9, 2022
@@ -123,6 +125,23 @@ public void testStrings() throws IOException {
);
}

public void testUnicodeKeys() throws IOException {
String fire = new String(new int[] { 0x1F525 }, 0, 1);
String coffee = "\u2615";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☕ is a good dimension

@nik9000 nik9000 merged commit 3d2c333 into elastic:master Feb 9, 2022
@nik9000 nik9000 deleted the tsdb_unicode_dims branch February 9, 2022 22:35
@weizijun
Copy link
Contributor

hi, @nik9000 , as the type of lucene field name is String, what's the use case for the key of tsid using unicode characters, Using the ByteRef to a Map key is a bit dangerous,as ByteRef is a reference, hashCode can be changed from its content, it may result in some strange behavior.
Maybe @jpountz can take a look at this PR?

@nik9000
Copy link
Member Author

nik9000 commented Feb 14, 2022

BytesRef is always a little more dangerous because you don't really know what folks are giving you. In this case we only need BytesRef for the duration of the method that receives it. Since we're not storing any references to the references I'd say it's save. But, even so, I opened #83799 to hide the BytesRefs from plain sight.

@nik9000
Copy link
Member Author

nik9000 commented Feb 14, 2022

I opened #83799 to hide the BytesRefs from plain sight.

And to force them to be used in a safe way.

@weizijun
Copy link
Contributor

as the type of lucene field name is String, what's the use case for the key of tsid using unicode characters?

hi, @nik9000 can you explain it? I would like to know the reason why String does not support storing field names.

@nik9000
Copy link
Member Author

nik9000 commented Feb 15, 2022

I would like to know the reason why String does not support storing field names.

It does. But it won't necessarily sort them in the same way as encoding the dimensions to utf-8 and then sorting them. I mean, there was a bug, but we don't need to invoke BytesRef in the Map to solve the bug. But I was mostly concerned about the sorting here. Mostly out of paranoia too. I think it'd be fairly hard to come up with field names that don't sort the same way. But we encode the data using utf-8 anyway so sorting by utf-8 was easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >non-issue :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) test-release Trigger CI checks against release build v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants