Skip to content

Add nullability to FieldSpec and use it in TypeFactory#11824

Closed
gortiz wants to merge 19 commits intoapache:masterfrom
gortiz:nullable-columns
Closed

Add nullability to FieldSpec and use it in TypeFactory#11824
gortiz wants to merge 19 commits intoapache:masterfrom
gortiz:nullable-columns

Conversation

@gortiz
Copy link
Contributor

@gortiz gortiz commented Oct 17, 2023

This is PR that has the intention of implementing column level nullability, as explained in #10381.

Current status:

  • Add nullability to Schema.FieldSpec.
  • Use Schema.FieldSpec.getNullable to inform Calcite about actual column nullability.
  • Use Schema.FieldSpec.getNullable to somehow support column level nullability in V1.
    • To modify V1 to actually support column level nullability seems a long task
    • What we are going to do in the first place is to:
      • Modify the leafs of the execution tree to ignore the actual null value vector for columns whose Schema.FieldSpec.getNullable is false. These columns will be treated as they were created with a TableConfig where nullHandlingEnabled was false.
      • Keep the rest of the V1 engine stable, so the new null vectors will be calculated for transformations applied to non nullable colums. But given these columns will produce values that are not actually null, transformations should be also not null.
  • Modify V2 leaves to force enableNullHandling=true.
    • For now we are not going to apply this change in this PR. Users will still need to provide this option at query time
  • Add tests that verify the changes.

Semantics in V1 are changed to:

enableNullHandling Segment has null vector FieldSpec* in master in PR
false N/A N/A not null not null
true false N/A not null not null
true true true (default) null null
true true false null not null

*: Value of FieldSpec.getNullable()

@gortiz gortiz marked this pull request as draft October 17, 2023 16:16
@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2023

Codecov Report

Merging #11824 (607c764) into master (f232a5f) will decrease coverage by 33.65%.
Report is 43 commits behind head on master.
The diff coverage is 20.00%.

@@              Coverage Diff              @@
##             master   #11824       +/-   ##
=============================================
- Coverage     61.25%   27.60%   -33.65%     
+ Complexity     1141      207      -934     
=============================================
  Files          2373     2385       +12     
  Lines        128263   129111      +848     
  Branches      19799    19954      +155     
=============================================
- Hits          78563    35646    -42917     
- Misses        44013    90785    +46772     
+ Partials       5687     2680     -3007     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (?)
java-11 <0.01% <0.00%> (?)
java-21 27.60% <20.00%> (-33.65%) ⬇️
skip-bytebuffers-false <0.01% <0.00%> (-27.57%) ⬇️
skip-bytebuffers-true 27.60% <20.00%> (-7.10%) ⬇️
temurin 27.60% <20.00%> (-33.65%) ⬇️
unittests 27.60% <20.00%> (-33.65%) ⬇️
unittests1 ?
unittests2 27.60% <20.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...t/core/query/selection/SelectionOperatorUtils.java 0.00% <ø> (-92.77%) ⬇️
...local/indexsegment/mutable/MutableSegmentImpl.java 71.84% <100.00%> (-0.20%) ⬇️
...indexsegment/immutable/ImmutableSegmentLoader.java 81.65% <0.00%> (-2.32%) ⬇️
...local/segment/index/loader/IndexLoadingConfig.java 78.93% <0.00%> (-0.58%) ⬇️
...al/segment/index/nullvalue/NullValueIndexType.java 75.00% <81.25%> (ø)
...main/java/org/apache/pinot/spi/data/FieldSpec.java 0.00% <0.00%> (-84.62%) ⬇️
...rc/main/java/org/apache/pinot/spi/data/Schema.java 0.00% <0.00%> (-73.77%) ⬇️
.../java/org/apache/pinot/query/type/TypeFactory.java 0.00% <0.00%> (-68.97%) ⬇️

... and 1200 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@gortiz gortiz marked this pull request as ready for review October 23, 2023 13:05
@gortiz gortiz marked this pull request as draft October 23, 2023 13:05
@gortiz
Copy link
Contributor Author

gortiz commented Oct 23, 2023

Tests are now passing. We need two extra things:

  • Enable enableNullHandling=true in leaves
  • Add null specific tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The long term goal is to deprecate IndexingConfig.isNullHandlingEnables(), and only rely on this field to decide whether a column is nullable. Currently IndexingConfig.isNullHandlingEnables() is used only to decide whether to create null value vector for the columns, and very likely we will make it true by default. Do you see a clear path to move away from the table level config?
There are potential big performance impact when a column is registered as nullable, so I wouldn't register it nullable unless it is explicitly configured nullable here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • should we default it to false unless otherwise specified (and thus no need for non-primitive boolean)
  • as long as the query flag EnableNullHandling overrides this setting we should be good for v1 compatibility --> as in v2 we will no use this option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see a clear path to move away from the table level config?
I think it is clear:

  1. Phase 1:
    • For read path: Use FieldSpec nullability if declared, otherwise delegate on IndexingConfig.isNullHandlingEnabled()
    • For write path: Use IndexingConfig.isNullHandlingEnabled()
  2. Phase 2:
    • For read path (same as Phase 1): Use FieldSpec nullability if declared, otherwise delegate on IndexingConfig.isNullHandlingEnabled()
    • For write path: Use FieldSpec nullability if declared, otherwise delegate on IndexingConfig.isNullHandlingEnabled()
  3. Phase 3:
  • Deprecate IndexingConfig.isNullHandlingEnabled() (although we still need to support it for compatibility reasons)

There are potential big performance impact when a column is registered as nullable, so I wouldn't register it nullable unless it is explicitly configured nullable here.

Sure, we need to document that and users should know that. With this change we expect to reduce the number of columns that are null right now, given what we are providing (specially in phase 1) is the ability to make columns not nullable in tables that are nullable by default.

should we default it to false unless otherwise specified (and thus no need for non-primitive boolean)
I don't think that is a good idea. Without making this non-primitive we cannot know whether the value was explicitly defined as false or it has defaulted to false. We would also be breaking backward compatibility. By using a primitive boolean with false by default, in order to make a column nullable users would need to:

  • In Phase 1:
    • Set IndexingConfig.isNullHandlingEnabled() to true. This is required at write time, given in V1 write path ignores FieldSpec nullability.
    • Set each, for each nullable column, FieldSpec nullability to true. This is a breaking change. All tables already created would need to be modified, otherwise they would be suddenly treated as not null.
  • In Phase 2:
    • Set each, for each nullable column, FieldSpec nullability to true.

The current implementation is backward compatible (users don't need to set FieldSpec nullability). The semantics can be read in NullValueIndexTypeTest. Another advantage of using a non-primitive boolean is that we don't lose information when field specs are serialized and deserialized without having to do any fancy thing.

For example, a FieldSpec where nullability is not defined will be serialized to:

{
  "name": "col1",
  "dataType": "INT"
}

while in your proposal it would be serialized by default to:

{
  "name": "col1",
  "dataType": "INT",
  "nullable": false
}

Alternatively we could not serialize nullable when its value is false, but then we would lose information when it was explicitly declared.

as long as the query flag EnableNullHandling overrides this setting we should be good for v1 compatibility --> as in v2 we will no use this option

enableNullHandling query flag does not override this value. They are applied at different parts of the code.

  • Column level nullability indicates:
    • In phase 1: Whether the null value vector should be considered for this column or not.
    • In phase 2: Phase 1 plus also be used to actually decide whether the null value vector is generated or not.
  • enableNullHandling query flag indicates whether we use null value vector during query execution or not. It basically means to change the algorithm used to evaluate some operators. For example, a count(col1) without enableNullHandling can be resolved reading segment metainfo while a the same count with enableNullHandling also requires to read the null value vector to do substract the number of null values in the column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for following up late:

Phase2: ... otherwise delegate on IndexingConfig.isNullHandlingEnabled() ?
Phase 3:
Deprecate IndexingConfig.isNullHandlingEnabled() (although we still need to support it for compatibility reasons)

^ when it is deprecated in phase 3 what's the logic originally in phase 2? is it false or true?

There are potential big performance impact

IMO: 4 scenarios currently

  1. IndexingConfig.isNullHandlingEnabled() = true and SET enableNullHandling=true;
  2. IndexingConfig.isNullHandlingEnabled() = false and SET enableNullHandling=true;
  3. IndexingConfig.isNullHandlingEnabled() = true and SET enableNullHandling=false;
  4. IndexingConfig.isNullHandlingEnabled() = false and SET enableNullHandling=false;

as long as the following are met we should be good:

  • performance on query should be the same on both 3&4 --> we should guarantee the same if all columns are marked as non-nullable by the user
  • performance on query should be the same on both 1&2 and will be worst than 3&4 b/c it accesses the null vector --> we should guarantee better performance when some columns are marked as non-nullable by the user

I don't think that is a good idea. Without making this non-primitive we cannot know whether the value was explicitly defined as false or it has defaulted to false. We would also be breaking backward compatibility. By using a primitive boolean with false by default, in order to make a column nullable users would need to:
In Phase 1:

  • Set IndexingConfig.isNullHandlingEnabled() to true. This is required at write time, given in V1 write path ignores FieldSpec nullability.
  • Set each, for each nullable column, FieldSpec nullability to true. This is a breaking change. All tables already created would need to be modified, otherwise they would be suddenly treated as not null.
    In Phase 2:
  • Set each, for each nullable column, FieldSpec nullability to true.

there's no backward compatibility here, b/c the default behavior of V1 is all non-nullable, so setting it to primitive boolean with default to false exactly achieves that,
you don't need to set nullability = true for all columns, b/c

  • in v1 this can be controlled (and should still be controlled by SET enableNullHandling=true option regardless whether this field is boolean or Boolean
  • in v2 there's no such option, so it will honor the nullability setting --> which requires it to be set explicitly to true anyway otherwise nothing works.

@gortiz
Copy link
Contributor Author

gortiz commented Oct 24, 2023

@walterddr Test are failing now because when null handling is enabled at query time, query plan changes. For example, OfflineClusterIntegrationTest.testNonScanAggregationQueries fails in SELECT MIN(ArrTime) FROM mytable because we do not use NonScanBasedAggregationOperator anymore.

We can improve that, but I think we should do that in another PR. Therefore I think I should modify these tests to move offending queries to do not assume they will be non scan based. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • should we default it to false unless otherwise specified (and thus no need for non-primitive boolean)
  • as long as the query flag EnableNullHandling overrides this setting we should be good for v1 compatibility --> as in v2 we will no use this option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition, let's add some tests to the NullHandling.json test file to validate the handling of null values are being considered

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only affect V2 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As we discussed, we could also decide to only do this in case the key is not already present. By doing that we would evaluate nulls by default but we could remove them in case we set enableNullHandling false option at query time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to remove this change. Lets decide whether we want to enable null handling in the leaves in a following iteration

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if my understanding is correct for the current change:

  • For write path (null value vector creation):
    • ON if only enabled at table level
    • Column level can override table level config if configured
  • For query path:
    • ON iff column level is explicitly enabled

IMO the above is the best way to keep backward compatibility

@Override
public IndexConfig getDefaultConfig() {
return IndexConfig.DISABLED;
return IndexConfig.ENABLED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky. It is necessary to change that to keep backward compatibility in case we read the segment without providing schema and/or table config, as it is usual in test (but not limited to them).

In that case, we end up delegating to the default index config. Before this PR, we returned DISABLED, but it wasn't problematic because in createIndexReader we completely ignored the IndexConfig. We returned the vector if and only if there was a buffer stored for this column and index. Now we don't do that. The new logic is to return not null if and only if:

  • There is a buffer for the column and index
  • And the index is enabled.

In normal cases, where table config and schema are provided, that means to delegate on the config returned by createDeserializer, which basically is:

fieldSpec.getNullable() == Boolean.TRUE || fieldSpec.getNullable() == null && nullHandlingEnabled

That is good enough. But in cases where schema or table config is not provided (which is something for some reason we do when reading segments in read only mode, like when doing tests) we delegate on the default values. That is why we need to default to ENABLE in order to return the null index in this case.

@gortiz
Copy link
Contributor Author

gortiz commented Oct 26, 2023

Let me know if my understanding is correct for the current change:

  • For write path (null value vector creation):

    • ON if only enabled at table level
    • Column level can override table level config if configured
  • For query path:

    • ON iff column level is explicitly enabled

IMO the above is the best way to keep backward compatibility

Not exactly. Column level can override table level config if configured affects query path, not write path. At least in phase 1. In future (phase 2) we want column level to affect write path as well.

Using your words, semantic in phase 1 is:

  • For write path (null value vector creation):
    • ON iff only enabled at table level
  • For query path:
    • ON if column level is explicitly enabled
    • Column level can override table level config if configured

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the other 2 tests in this file) started to fail as soon as I added new tests in NullHandling.json. I'm not sure why, but it happens when adding more queries, so it looks like an issue in the test framework.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exception did you get for these queries?

@Jackie-Jiang
Copy link
Contributor

Not exactly. Column level can override table level config if configured affects query path, not write path. At least in phase 1. In future (phase 2) we want column level to affect write path as well.

Using your words, semantic in phase 1 is:

  • For write path (null value vector creation):

    • ON iff only enabled at table level
  • For query path:

    • ON if column level is explicitly enabled
    • Column level can override table level config if configured

I don't quite follow then.

  • In the write (creation) path, we create the null value vector only when it is enabled. When column level is explicitly disabled, null vector will be disabled even though it is enabled at table level right?
  • In the loading path, we follow the same rule as write path
  • In the registering path (for multi-stage engine), we register the column as nullable only when it is explicitly enabled at column level

@gortiz
Copy link
Contributor Author

gortiz commented Oct 30, 2023

In the write (creation) path, we create the null value vector only when it is enabled. When column level is explicitly disabled, null vector will be disabled even though it is enabled at table level right?

That is not correct. There is a special case for null value vector in SegmentColumnarIndexCreator that creates the index depending on config.isNullHandlingEnabled without actually reading index configuration.

That is part of the code I would like to remove in a second phase.

In the loading path, we follow the same rule as write path

I don't know exactly what you mean with loading part. If you mean when we create the index reader, yes. With the changes included in this PR we start honoring the column level nullability, defined in function of config.isNullHandlingEnabled() and fieldSpec.getNullable().

But in V1 query engine we don't care about whether null value vector is enabled or not, we just care about enableNullHandling query option.

In the registering path (for multi-stage engine), we register the column as nullable only when it is explicitly enabled at column level

This is correct. And it is considered nullable iff fieldSpec.getNullable() == true || fieldSpec.getNullable() == null && config.isNullHandlingEnabled().

@gortiz gortiz changed the title [Draft] Add nullability to FieldSpec and use it in TypeFactory Add nullability to FieldSpec and use it in TypeFactory Oct 31, 2023
@gortiz gortiz marked this pull request as ready for review October 31, 2023 13:49
@gortiz
Copy link
Contributor Author

gortiz commented Oct 31, 2023

I think the PR is ready to review. Some tests have been added (although more can be added in future or if they are requested by reviewers).

@walterddr
Copy link
Contributor

Using your words, semantic in phase 1 is:

  • For write path (null value vector creation):

    • ON iff only enabled at table level
  • For query path:

    • ON if column level is explicitly enabled
    • Column level can override table level config if configured

^ for query path:
we should do

  • ON if column level is explicitly enabled
  • Column level can BE overridden by query-level config if configured

table level config should not be used here.

@gortiz
Copy link
Contributor Author

gortiz commented Nov 2, 2023

Column level can BE overridden by query-level config if configured

I don't think we can do this without actually applying deep changes to the code. Column level config (defined in Schema and TableConfig) cannot be changed by query level config right now, as they are concepts that apply to different contexts.

Situation right now

Schema and TableConfig define whether the null vector index buffer present in a segment is read or not*. At the time the index is being read, there is no concept of query-level config. In fact there is no query context at all.

When the query arrives to the server, servers prepare a DataSource for each column and segment that are used in the query. These DataSources know about the Schema but not about the TableConfig. These DataSources are then wrapped in a ProjectionBlockValSet which implements BlockValSet. These BlockValSet do not carry type information (neither base type, multivalue or nullability). Neither DataSource or BlockValSet do anything fancy with nulls. They just store the null bitmap and the raw segment data. For example, a column created from input [1, null, 2] (assuming IndexingConfig.isNullHandlingEnabled() is true) will store:

  • A null bitmap that says docId 1 is null
  • A buffer/array that stores [1, -2147483648**, 1]

These blocks are what Operators use at runtime. Operators have their own expected base type (int, long, string, etc) and whether are multivalued or not, but do not carry nullability information. But they are associated with a query context, so they decide whether to deal with nulls or not depending on QueryContext.isNullHandlingEnabled (which is configured by a query option). When they deal with nulls, they get the raw value array and null bitmap and merge them.

Therefore Schema and TableConfig define whether the null value index is read or not and QueryContext.isNullHandlingEnabled defines whether the Operators are going to use that null value index or not.

There is no way, AFAIK, to make Operators aware of nullability defined in Schema and TableConfig and also there is no way to make IndexType aware of the QueryContext.isNullHandlingEnabled. So one cannot override the other.

What this PR does is to change how NullValueIndexType.ReaderFactory.createIndexReader() behaves so by setting a column not nullable in Schema, the null value index won't be read. I've added a table in the PR description indicating the new semantics.

*: In fact in master right now Schema and TableConfig are ignored for null value vector. The main change in this PR is precisely to change that to do not read the null index buffer if Schema or TableConfig disable the index.

**: -2147483648 is the default null value for ints

Alternative

There is another way to implement a similar behavior, but requires a bit more code:

  1. Don't change NullValueIndexType at all. Bitmaps will be returned even when Schema and TableConfig say the column is not nullable. It is weird but that is how V1 works right now.
  2. Define the following query modes to deal with nulls:
    1. NEVER_NULLABLE -> Equal to current SET nullHandlingEnabled=false.
    2. ALWAYS_NULLABLE -> Equal to current SET nullHandlingEnabled=true.
    3. SCHEMA_BASED -> Explained above.
  3. In order to change the null mode, we can either change query option nullHandlingEnabled to support more than true/false or add another option.
  4. Override the method in org.apache.pinot.segment.local.segment.index.datasource.BaseDataSource.NullValueVectorReader getNullValueVector() adding org.apache.pinot.segment.local.segment.index.datasource.BaseDataSource.NullValueVectorReader getNullValueVector(NullMode mode)
    1. In case mode is NEVER_NULLABLE, always return null.
    2. In case mode is ALWAYS_NULLABLE, always return whatever is stored in the segment (this is the current behavior).
    3. In case mode is SCHEMA_BASED, the result will depend on the schema. Given DataSource knows the schema, it can change its behavior.
      1. If column is defined as nullable: return whatever is stored in the segment
      2. Else: return null

Same should be done in BlockValSet. In ProjectionBlockValSet we will delegate on DataSource.getNullValueVector and in the others we behave as ALWAYS_NULLABLE. By doing so other BlockValSet like TransformBlockValSet would call _valueBlock.getNullBitmap(mode) and deal with the result as they please.

Lately, all operators would need to call block.getNullBitmap(NullMode) and decide which mode to use using the QueryContext.

This solution should work and contrary to this PR, it should not break backward compatibility in any way. But it would require deeper code modifications.

I'll create a new PR with that approach so we can compare them.

}

public void registerTable(Schema schema, String tableName) {
public void registerTable(Schema schema, String tableName, boolean enableNullHandling) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to generate null value vector for the table? IMO we should always generate null value vector. We will persist the null value vector only when there are null values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should always generate null value vector. We will persist the null value vector only when there are null values.

I have to be clear. The changes in the test engine were copied from @walterddr PR. Anyway, I think it may be interesting to test the behavior when null vector was not created at ingestion time (IndexingConfig.isNullHandlingEnabled was false at ingestion time).

At the same time, I agree that this situation is very simple (if there is no null vector, we assume there is no null) and therefore I would say it would be better to assume in tests that by default tables IndexingConfig.isNullHandlingEnabled is true and allow to optionally disable it.

I disagree. It is useful to test how we behave in case the file doesn't have null vector and we ask for it. I know it is trivial (we just behave as it was not null value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exception did you get for these queries?

tableConfig.getIndexingConfig().isNullHandlingEnabled()
? IndexConfig.ENABLED
: IndexConfig.DISABLED));
return (TableConfig tableConfig, Schema schema) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not correct. There is a special case for null value vector in SegmentColumnarIndexCreator that creates the index depending on config.isNullHandlingEnabled without actually reading index configuration.

Does that mean we create null value vector for all columns only when it is configured in the table config, and the column level nullable is ignored; When reading the null value vector, we do check the column level nullable flag?
This behavior seems inconsistent and there is no way to only create null value vector for part of the columns. Can we make the write path logic the same as read path? i.e. check if it is enabled at column level, then decide whether to create the null value vector?

Copy link
Contributor Author

@gortiz gortiz Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday @walterddr and I were discussing about this PR. We think that #11935 is a better approach given it doesn't need to modify how to load the index. The changes there are very trivial (changing from storing boolean _enableNullHandling with NullMode _nullMode). The main issue is that the null handling code is not well organized and very repetitive, so I ended up modifying more than 1k lines.

But we think we are going to be focused on that PR instead of this one. What do you think?

@gortiz gortiz marked this pull request as draft November 3, 2023 08:11
@gortiz gortiz marked this pull request as ready for review November 16, 2023 09:40
@gortiz
Copy link
Contributor Author

gortiz commented Jan 5, 2024

The same issue was solved in #11960

@gortiz gortiz closed this Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants