Skip to content

About FieldConfig.IndexType values #10615

@gortiz

Description

@gortiz

Objective

The objective of this issue is to discuss about FieldConfig.IndexType, which:

We cannot remove this enum because we need to keep backward compatibility, but we should decide what to do with the four literals that are not documented and are not being used (or used in an inconsistent way).

Description

FieldConfig.indexTypes was intended to be the new way to create all indexes, but it was superseded by FieldConfig.indexes attribute introduced in #10183. The main problems with FieldConfig.indexTypes were that:

  • Most indexes can be configured in some way. To do so, FieldConfig.indexTypes required to introduce unsafe string-based configs in FieldConfig.properties.
  • IndexTypes is an enum that doesn't contain all valid indexes and includes some extra values that are not valid.

This issue is related to the last point. Reading the documentation, the valid values here should be: Text, FST, Timestamp and h3. But the enum right now contains the following literals:

  • TEXT
  • FST
  • JSON
  • H3
  • INVERTED
  • SORTED
  • RANGE
  • TIMESTAMP

The last 4 are not included in the documentation and they are not being consistently used in the code.

Inconsistent usage

To do this analysis I used tag release-0.12.1. The reason to do not use master is that #10183 (already merged) modifies a lot of code related to indexes, so I want to be sure that the inconsistent usage of these literals was already there before index-spi was merged:

Inverted

  • It was introduced in the first version of this document (2020-02-07).
  • It is not read in production code. The only two usages are in MutableSegmentImpl (here, and here). In both cases it is being used to report some error.
  • It is being used in test, but its usage doesn't seem actually important.

Sorted

  • It was introduced in the first version of this document (2020-02-07).
  • It is only being used in HadoopSegmentPreprocessingJob. This class does not exist in master and it is not being used in master. The fact that it is not an actual index type and it was only used here makes me think that it was introduced as an attempt to do something, but it wasn't finished.

Range

Timestamp

  • This was introduced in Timestamp type index #8343 (2022-4-11).
  • The only usage in production code is a check. This check fails if the FieldConfig points to a column that is not a timestamp but indexTypes contains the TIMESTAMP literal.
  • Timestamp indexes are not actual indexes. There is no timestamp index writer, reader or handler. They are syntactic sugar (or an alias). When pinot detects that this syntactic sugar has to be applied (more on that later) what actually happens is that Pinot generates one extra row for each timestamp granularity and index that column with a range index.
  • Contrary to what documentation says, FieldConfig.indexTypes doesn't have to contain the TIMESTAMP literal in order to activate this syntactic sugar. It is activated if and only if the FieldConfig.timestampConfig is not null. Whether or not TIMESTAMP is included in FieldConfig.indexTypes has zero impact on that decision.
  • The timestamp syntactic sugar is just a modification of the table config and the schema and it is applied in:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions