-
Notifications
You must be signed in to change notification settings - Fork 25.3k
TSDB: unicode dimensions #83681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSDB: unicode dimensions #83681
Conversation
Pinging @elastic/es-analytics-geo (Team:Analytics) |
This fixes dimensions with unicode characters in their name. It also make the sort order of these dimensions something we can explain: convert all dimensions to utf-8 and sort by that.
@@ -36,7 +36,7 @@ | |||
* for generating the _tsid field. The map will be used by {@link TimeSeriesIdFieldMapper} | |||
* to build the _tsid field for the document. | |||
*/ | |||
private SortedMap<String, BytesReference> dimensionBytes; | |||
private SortedMap<BytesRef, BytesReference> dimensionBytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to bytes ref here to make it really obvious these are sorted by utf-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ulterior motive: I'll want them in utf-8 in more than one place.
"Dimension name must be less than [%s] bytes but [%s] was [%s].", | ||
DIMENSION_NAME_LIMIT, | ||
fieldName.utf8ToString(), | ||
fieldName.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatter made a mess of this when I made the parameters a little longer so I switched it to String.format so human can read the message.
@@ -193,7 +197,7 @@ public static void encodeTsid(StreamOutput out, SortedMap<String, BytesReference | |||
Map<String, Object> result = new LinkedHashMap<String, Object>(size); | |||
|
|||
for (int i = 0; i < size; i++) { | |||
String name = in.readString(); | |||
String name = in.readBytesRef().utf8ToString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the bug that made this not work - sometimes BytesRef and String don't write the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a value in hiding the fact that it's BytesRef all the way down and can this hurt us beyond serialization thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking this should be it's own class? Like one that just wraps BytesRef
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if I am completely honest, I was thinking why do we have 2 different maps with String in one case and BytesRef in another case that essentially represent the same data. But I think I like your interpretation much better. It has quite a bit of complexity that could probably benefit from some encapsulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, maybe we need a TimeSeriesId class to represent this sort of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like me to do it in this or a followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's not mix refactoring with the actual change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
server/src/main/java/org/elasticsearch/index/mapper/TimeSeriesIdFieldMapper.java
Outdated
Show resolved
Hide resolved
@@ -123,6 +125,23 @@ public void testStrings() throws IOException { | |||
); | |||
} | |||
|
|||
public void testUnicodeKeys() throws IOException { | |||
String fire = new String(new int[] { 0x1F525 }, 0, 1); | |||
String coffee = "\u2615"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☕ is a good dimension
hi, @nik9000 , as the type of lucene field name is |
|
And to force them to be used in a safe way. |
hi, @nik9000 can you explain it? I would like to know the reason why |
It does. But it won't necessarily sort them in the same way as encoding the dimensions to utf-8 and then sorting them. I mean, there was a bug, but we don't need to invoke BytesRef in the Map to solve the bug. But I was mostly concerned about the sorting here. Mostly out of paranoia too. I think it'd be fairly hard to come up with field names that don't sort the same way. But we encode the data using utf-8 anyway so sorting by utf-8 was easy. |
This fixes dimensions with unicode characters in their name. It also
make the sort order of these dimensions something we can explain:
convert all dimensions to utf-8 and sort by that.