TSDB: unicode dimensions #83681

nik9000 · 2022-02-08T20:33:01Z

This fixes dimensions with unicode characters in their name. It also
make the sort order of these dimensions something we can explain:
convert all dimensions to utf-8 and sort by that.

elasticmachine · 2022-02-08T20:33:05Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

This fixes dimensions with unicode characters in their name. It also make the sort order of these dimensions something we can explain: convert all dimensions to utf-8 and sort by that.

nik9000 · 2022-02-08T20:33:27Z

server/src/main/java/org/elasticsearch/index/mapper/LuceneDocument.java

@@ -36,7 +36,7 @@
     * for generating the _tsid field. The map will be used by {@link TimeSeriesIdFieldMapper}
     * to build the _tsid field for the document.
     */
-    private SortedMap<String, BytesReference> dimensionBytes;
+    private SortedMap<BytesRef, BytesReference> dimensionBytes;


I switched to bytes ref here to make it really obvious these are sorted by utf-8.

Ulterior motive: I'll want them in utf-8 in more than one place.

nik9000 · 2022-02-08T20:34:37Z

server/src/main/java/org/elasticsearch/index/mapper/TimeSeriesIdFieldMapper.java

+                        "Dimension name must be less than [%s] bytes but [%s] was [%s].",
+                        DIMENSION_NAME_LIMIT,
+                        fieldName.utf8ToString(),
+                        fieldName.length


The formatter made a mess of this when I made the parameters a little longer so I switched it to String.format so human can read the message.

nik9000 · 2022-02-08T20:38:01Z

server/src/main/java/org/elasticsearch/index/mapper/TimeSeriesIdFieldMapper.java

@@ -193,7 +197,7 @@ public static void encodeTsid(StreamOutput out, SortedMap<String, BytesReference
            Map<String, Object> result = new LinkedHashMap<String, Object>(size);

            for (int i = 0; i < size; i++) {
-                String name = in.readString();
+                String name = in.readBytesRef().utf8ToString();


Here's the bug that made this not work - sometimes BytesRef and String don't write the same.

I wonder if there is a value in hiding the fact that it's BytesRef all the way down and can this hurt us beyond serialization thing.

Are you thinking this should be it's own class? Like one that just wraps BytesRef or something?

Well, if I am completely honest, I was thinking why do we have 2 different maps with String in one case and BytesRef in another case that essentially represent the same data. But I think I like your interpretation much better. It has quite a bit of complexity that could probably benefit from some encapsulation.

Just to be clear, maybe we need a TimeSeriesId class to represent this sort of data.

Would you like me to do it in this or a followup?

Yeah, let's not mix refactoring with the actual change.

csoulios

LGTM!

server/src/main/java/org/elasticsearch/index/mapper/TimeSeriesIdFieldMapper.java

imotov · 2022-02-09T21:06:44Z

server/src/test/java/org/elasticsearch/index/mapper/TimeSeriesIdFieldMapperTests.java

@@ -123,6 +125,23 @@ public void testStrings() throws IOException {
        );
    }

+    public void testUnicodeKeys() throws IOException {
+        String fire = new String(new int[] { 0x1F525 }, 0, 1);
+        String coffee = "\u2615";


☕ is a good dimension

weizijun · 2022-02-14T09:16:44Z

hi, @nik9000 , as the type of lucene field name is String, what's the use case for the key of tsid using unicode characters, Using the ByteRef to a Map key is a bit dangerous，as ByteRef is a reference, hashCode can be changed from its content, it may result in some strange behavior.
Maybe @jpountz can take a look at this PR?

nik9000 · 2022-02-14T19:32:58Z

BytesRef is always a little more dangerous because you don't really know what folks are giving you. In this case we only need BytesRef for the duration of the method that receives it. Since we're not storing any references to the references I'd say it's save. But, even so, I opened #83799 to hide the BytesRefs from plain sight.

nik9000 · 2022-02-14T19:33:13Z

I opened #83799 to hide the BytesRefs from plain sight.

And to force them to be used in a safe way.

weizijun · 2022-02-15T08:27:16Z

as the type of lucene field name is String, what's the use case for the key of tsid using unicode characters?

hi, @nik9000 can you explain it? I would like to know the reason why String does not support storing field names.

nik9000 · 2022-02-15T15:58:04Z

I would like to know the reason why String does not support storing field names.

It does. But it won't necessarily sort them in the same way as encoding the dimensions to utf-8 and then sorting them. I mean, there was a bug, but we don't need to invoke BytesRef in the Map to solve the bug. But I was mostly concerned about the sorting here. Mostly out of paranoia too. I think it'd be fairly hard to come up with field names that don't sort the same way. But we encode the data using utf-8 anyway so sorting by utf-8 was easy.

nik9000 added >non-issue :StorageEngine/TSDB You know, for Metrics v8.2.0 labels Feb 8, 2022

nik9000 requested review from imotov and csoulios February 8, 2022 20:33

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 8, 2022

TSDB: unicode dimensions

938b212

This fixes dimensions with unicode characters in their name. It also make the sort order of these dimensions something we can explain: convert all dimensions to utf-8 and sort by that.

nik9000 commented Feb 8, 2022

View reviewed changes

Comment

3338dd8

nik9000 added the test-release Trigger CI checks against release build label Feb 9, 2022

csoulios approved these changes Feb 9, 2022

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/TimeSeriesIdFieldMapper.java Outdated Show resolved Hide resolved

nik9000 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Feb 9, 2022

nik9000 added 2 commits February 9, 2022 15:47

Merge branch 'master' into tsdb_unicode_dims

bd0aad0

Format

639de64

imotov reviewed Feb 9, 2022

View reviewed changes

nik9000 merged commit 3d2c333 into elastic:master Feb 9, 2022

nik9000 deleted the tsdb_unicode_dims branch February 9, 2022 22:35

TSDB: unicode dimensions #83681

TSDB: unicode dimensions #83681

Uh oh!

Conversation

nik9000 commented Feb 8, 2022

Uh oh!

elasticmachine commented Feb 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csoulios left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weizijun commented Feb 14, 2022

Uh oh!

nik9000 commented Feb 14, 2022

Uh oh!

nik9000 commented Feb 14, 2022

Uh oh!

weizijun commented Feb 15, 2022

Uh oh!

nik9000 commented Feb 15, 2022

Uh oh!

Uh oh!