Skip to content

Explicit null handling#11960

Merged
Jackie-Jiang merged 55 commits intoapache:masterfrom
gortiz:schema-null-handling
Nov 20, 2023
Merged

Explicit null handling#11960
Jackie-Jiang merged 55 commits intoapache:masterfrom
gortiz:schema-null-handling

Conversation

@gortiz
Copy link
Contributor

@gortiz gortiz commented Nov 7, 2023

This is another approach to implement column level nullability, as explained in #10381.

Introducing null handling modes

After discussing with @Jackie-Jiang, @walterddr and @npawar we decided to propose this third approach where each table specify how to handle nulls. There are two possible ways to handle null:

  • Table mode, which is what we were using so far
  • Column mode, which means that each column specifies its own nullability in the FieldSpec.

Multi-stage engine (aka V2) has also been modified to reject queries where SET enableNullHandling = true AND at least one of the tables involved in the query uses Table mode null handling.

TODO

  • Modifications in Schema
  • Modifications in TypeFactory
  • Modifications in NullValueIndexType
  • Modifications in the ingestion side
  • Modifications in test engine
  • Write resource tests

Proposed Syntax

Although the null handling mode semantics are clear, it is not so clear how to configure them. What is explained here is just a proposal open to discussion and can be changed during the PR review.

The proposal is to add a new JSON property in the schema*.

This new property can be configured as:

{
  "schemaName": "blablabla",
  "dimensionFieldSpecs": [
    {
      "dataType": "INT",
      "name": "nullableField",
      "notNull": false
    },
    {
      "dataType": "INT",
      "name": "notNullableField",
      "notNull": true
    },
    {
      "dataType": "INT",
      "name": "defaultNullableField"
    },
    ...
  ],
  "enableColumnBasedNullHandling": true/false
}

The default value for enableColumnBasedNullHandling is false, which keeps compatibility with previous code. When it is changed to true, Pinot will ignore TableConfig.IndexingConfig.nullHandlingEnabled and columns will be nullable if and only if FieldSpec.notNull is false, which is also the default value.

* It is yet to be decide whether it will be added to the TableConfig or the Schema. When this PR was open it was added to the Schema.

@gortiz gortiz marked this pull request as draft November 7, 2023 11:28
@gortiz
Copy link
Contributor Author

gortiz commented Nov 16, 2023

In https://github.com/apache/pinot/actions/runs/6874613568/job/18696611890?pr=11960 you can see the test that failed. I'm going to comment them again in order to have a green light.

@walterddr
Copy link
Contributor

In https://github.com/apache/pinot/actions/runs/6874613568/job/18696611890?pr=11960 you can see the test that failed. I'm going to comment them again in order to have a green light.

the output value was changed in #11234 CC @Jackie-Jiang to take a look.

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me mostly. please kindly address the follow ups. mostly on changes that can be separate into different PR

private static final String FILE_FILTER_PROPERTY = "pinot.fileFilter";
private static final String IGNORE_FILTER_PROPERTY = "pinot.runIgnored";
private static final int DEFAULT_NUM_PARTITIONS = 4;
private static final boolean REPORT_IGNORES = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we separate this one from this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rollbacked most of these changes in 3dd0886

}

@DataProvider
@DataProvider(parallel = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wasn't sure this is fully verified. if possible let's keep this separate into another PR

gortiz and others added 4 commits November 17, 2023 18:54
…al/segment/index/loader/defaultcolumn/BaseDefaultColumnHandler.java

Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

"options": {
"nullHandling": {
"mode": "column",
"default": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^
Also, how does it work right now? Do we have test that rely on column based null handling to be enabled?

{
"comments": "table aren't actually partitioned by val thus all segments can produce duplicate results, thus [[8]]",
"sql": "SELECT SEGMENT_PARTITIONED_DISTINCT_COUNT(val) FROM {tbl1}",
"ignored": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we try removing the ignored flag and see if they still fail? If so, we might want to debug them because this can cause regression


@Test(dataProvider = "provideCases", dataProviderClass = NullValueIndexTypeTest.class)
public void isEnabledWhenNullable(boolean columnNullHandling, boolean fieldNullable, IndexConfig expected) {
_tableConfig.getIndexingConfig().setNullHandlingEnabled(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed b/c the logic now rely on table config. (backward incompat?)

@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes Configuration Config changes (addition/deletion/change in behavior) multi-stage Related to the multi-stage query engine documentation labels Nov 20, 2023
@Jackie-Jiang Jackie-Jiang merged commit 35748a0 into apache:master Nov 20, 2023
@gortiz gortiz deleted the schema-null-handling branch December 7, 2023 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Configuration Config changes (addition/deletion/change in behavior) documentation feature multi-stage Related to the multi-stage query engine null support release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants