Add nullability to FieldSpec and use it in TypeFactory#11824
Add nullability to FieldSpec and use it in TypeFactory#11824gortiz wants to merge 19 commits intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11824 +/- ##
=============================================
- Coverage 61.25% 27.60% -33.65%
+ Complexity 1141 207 -934
=============================================
Files 2373 2385 +12
Lines 128263 129111 +848
Branches 19799 19954 +155
=============================================
- Hits 78563 35646 -42917
- Misses 44013 90785 +46772
+ Partials 5687 2680 -3007
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1200 files with indirect coverage changes 📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today! |
67fa55b to
bcf16e4
Compare
|
Tests are now passing. We need two extra things:
|
There was a problem hiding this comment.
The long term goal is to deprecate IndexingConfig.isNullHandlingEnables(), and only rely on this field to decide whether a column is nullable. Currently IndexingConfig.isNullHandlingEnables() is used only to decide whether to create null value vector for the columns, and very likely we will make it true by default. Do you see a clear path to move away from the table level config?
There are potential big performance impact when a column is registered as nullable, so I wouldn't register it nullable unless it is explicitly configured nullable here.
There was a problem hiding this comment.
- should we default it to false unless otherwise specified (and thus no need for non-primitive boolean)
- as long as the query flag
EnableNullHandlingoverrides this setting we should be good for v1 compatibility --> as in v2 we will no use this option
There was a problem hiding this comment.
Do you see a clear path to move away from the table level config?
I think it is clear:
- Phase 1:
- For read path: Use
FieldSpecnullability if declared, otherwise delegate onIndexingConfig.isNullHandlingEnabled() - For write path: Use
IndexingConfig.isNullHandlingEnabled()
- For read path: Use
- Phase 2:
- For read path (same as Phase 1): Use
FieldSpecnullability if declared, otherwise delegate onIndexingConfig.isNullHandlingEnabled() - For write path: Use
FieldSpecnullability if declared, otherwise delegate onIndexingConfig.isNullHandlingEnabled()
- For read path (same as Phase 1): Use
- Phase 3:
- Deprecate
IndexingConfig.isNullHandlingEnabled()(although we still need to support it for compatibility reasons)
There are potential big performance impact when a column is registered as nullable, so I wouldn't register it nullable unless it is explicitly configured nullable here.
Sure, we need to document that and users should know that. With this change we expect to reduce the number of columns that are null right now, given what we are providing (specially in phase 1) is the ability to make columns not nullable in tables that are nullable by default.
should we default it to false unless otherwise specified (and thus no need for non-primitive boolean)
I don't think that is a good idea. Without making this non-primitive we cannot know whether the value was explicitly defined as false or it has defaulted to false. We would also be breaking backward compatibility. By using a primitive boolean with false by default, in order to make a column nullable users would need to:
- In Phase 1:
- Set
IndexingConfig.isNullHandlingEnabled()to true. This is required at write time, given in V1 write path ignoresFieldSpecnullability. - Set each, for each nullable column,
FieldSpecnullability to true. This is a breaking change. All tables already created would need to be modified, otherwise they would be suddenly treated as not null.
- Set
- In Phase 2:
- Set each, for each nullable column,
FieldSpecnullability to true.
- Set each, for each nullable column,
The current implementation is backward compatible (users don't need to set FieldSpec nullability). The semantics can be read in NullValueIndexTypeTest. Another advantage of using a non-primitive boolean is that we don't lose information when field specs are serialized and deserialized without having to do any fancy thing.
For example, a FieldSpec where nullability is not defined will be serialized to:
{
"name": "col1",
"dataType": "INT"
}while in your proposal it would be serialized by default to:
{
"name": "col1",
"dataType": "INT",
"nullable": false
}Alternatively we could not serialize nullable when its value is false, but then we would lose information when it was explicitly declared.
as long as the query flag EnableNullHandling overrides this setting we should be good for v1 compatibility --> as in v2 we will no use this option
enableNullHandling query flag does not override this value. They are applied at different parts of the code.
- Column level nullability indicates:
- In phase 1: Whether the null value vector should be considered for this column or not.
- In phase 2: Phase 1 plus also be used to actually decide whether the null value vector is generated or not.
enableNullHandlingquery flag indicates whether we use null value vector during query execution or not. It basically means to change the algorithm used to evaluate some operators. For example, acount(col1)withoutenableNullHandlingcan be resolved reading segment metainfo while a the same count withenableNullHandlingalso requires to read the null value vector to do substract the number of null values in the column
There was a problem hiding this comment.
sorry for following up late:
Phase2: ... otherwise delegate on
IndexingConfig.isNullHandlingEnabled()?
Phase 3:
DeprecateIndexingConfig.isNullHandlingEnabled()(although we still need to support it for compatibility reasons)
^ when it is deprecated in phase 3 what's the logic originally in phase 2? is it false or true?
There are potential big performance impact
IMO: 4 scenarios currently
IndexingConfig.isNullHandlingEnabled() = trueandSET enableNullHandling=true;IndexingConfig.isNullHandlingEnabled() = falseandSET enableNullHandling=true;IndexingConfig.isNullHandlingEnabled() = trueandSET enableNullHandling=false;IndexingConfig.isNullHandlingEnabled() = falseandSET enableNullHandling=false;
as long as the following are met we should be good:
- performance on query should be the same on both 3&4 --> we should guarantee the same if all columns are marked as non-nullable by the user
- performance on query should be the same on both 1&2 and will be worst than 3&4 b/c it accesses the null vector --> we should guarantee better performance when some columns are marked as non-nullable by the user
I don't think that is a good idea. Without making this non-primitive we cannot know whether the value was explicitly defined as false or it has defaulted to false. We would also be breaking backward compatibility. By using a primitive boolean with false by default, in order to make a column nullable users would need to:
In Phase 1:
- Set IndexingConfig.isNullHandlingEnabled() to true. This is required at write time, given in V1 write path ignores FieldSpec nullability.
- Set each, for each nullable column, FieldSpec nullability to true. This is a breaking change. All tables already created would need to be modified, otherwise they would be suddenly treated as not null.
In Phase 2:- Set each, for each nullable column, FieldSpec nullability to true.
there's no backward compatibility here, b/c the default behavior of V1 is all non-nullable, so setting it to primitive boolean with default to false exactly achieves that,
you don't need to set nullability = true for all columns, b/c
- in v1 this can be controlled (and should still be controlled by
SET enableNullHandling=trueoption regardless whether this field isbooleanorBoolean - in v2 there's no such option, so it will honor the nullability setting --> which requires it to be set explicitly to true anyway otherwise nothing works.
|
@walterddr Test are failing now because when null handling is enabled at query time, query plan changes. For example, OfflineClusterIntegrationTest.testNonScanAggregationQueries fails in We can improve that, but I think we should do that in another PR. Therefore I think I should modify these tests to move offending queries to do not assume they will be non scan based. What do you think? |
pinot-query-planner/src/main/java/org/apache/pinot/query/type/TypeSystem.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
- should we default it to false unless otherwise specified (and thus no need for non-primitive boolean)
- as long as the query flag
EnableNullHandlingoverrides this setting we should be good for v1 compatibility --> as in v2 we will no use this option
There was a problem hiding this comment.
in addition, let's add some tests to the NullHandling.json test file to validate the handling of null values are being considered
There was a problem hiding this comment.
this only affect V2 right?
There was a problem hiding this comment.
Yes. As we discussed, we could also decide to only do this in case the key is not already present. By doing that we would evaluate nulls by default but we could remove them in case we set enableNullHandling false option at query time.
There was a problem hiding this comment.
I'm going to remove this change. Lets decide whether we want to enable null handling in the leaves in a following iteration
In production this should have no impact, but will keep current behavior when ImmutableSegment.load(File, ReadMode) is called.
…Value is specified.
67a4212 to
9d80da3
Compare
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Let me know if my understanding is correct for the current change:
- For write path (null value vector creation):
- ON if only enabled at table level
- Column level can override table level config if configured
- For query path:
- ON iff column level is explicitly enabled
IMO the above is the best way to keep backward compatibility
pinot-query-planner/src/test/java/org/apache/pinot/query/QueryEnvironmentTestBase.java
Show resolved
Hide resolved
| @Override | ||
| public IndexConfig getDefaultConfig() { | ||
| return IndexConfig.DISABLED; | ||
| return IndexConfig.ENABLED; |
There was a problem hiding this comment.
This is tricky. It is necessary to change that to keep backward compatibility in case we read the segment without providing schema and/or table config, as it is usual in test (but not limited to them).
In that case, we end up delegating to the default index config. Before this PR, we returned DISABLED, but it wasn't problematic because in createIndexReader we completely ignored the IndexConfig. We returned the vector if and only if there was a buffer stored for this column and index. Now we don't do that. The new logic is to return not null if and only if:
- There is a buffer for the column and index
- And the index is enabled.
In normal cases, where table config and schema are provided, that means to delegate on the config returned by createDeserializer, which basically is:
fieldSpec.getNullable() == Boolean.TRUE || fieldSpec.getNullable() == null && nullHandlingEnabledThat is good enough. But in cases where schema or table config is not provided (which is something for some reason we do when reading segments in read only mode, like when doing tests) we delegate on the default values. That is why we need to default to ENABLE in order to return the null index in this case.
Not exactly. Column level can override table level config if configured affects query path, not write path. At least in phase 1. In future (phase 2) we want column level to affect write path as well. Using your words, semantic in phase 1 is:
|
There was a problem hiding this comment.
This (and the other 2 tests in this file) started to fail as soon as I added new tests in NullHandling.json. I'm not sure why, but it happens when adding more queries, so it looks like an issue in the test framework.
There was a problem hiding this comment.
What exception did you get for these queries?
caa5d95 to
c8105aa
Compare
pinot-query-runtime/src/test/resources/queries/NullHandling.json
Outdated
Show resolved
Hide resolved
pinot-query-runtime/src/test/resources/queries/NullHandling.json
Outdated
Show resolved
Hide resolved
ba1d97b to
675601a
Compare
I don't quite follow then.
|
That is not correct. There is a special case for null value vector in SegmentColumnarIndexCreator that creates the index depending on That is part of the code I would like to remove in a second phase.
I don't know exactly what you mean with loading part. If you mean when we create the index reader, yes. With the changes included in this PR we start honoring the column level nullability, defined in function of But in V1 query engine we don't care about whether null value vector is enabled or not, we just care about
This is correct. And it is considered nullable iff |
29a5116 to
9ae4edb
Compare
|
I think the PR is ready to review. Some tests have been added (although more can be added in future or if they are requested by reviewers). |
^ for query path:
table level config should not be used here. |
I don't think we can do this without actually applying deep changes to the code. Column level config (defined in Schema and TableConfig) cannot be changed by query level config right now, as they are concepts that apply to different contexts. Situation right nowSchema and TableConfig define whether the null vector index buffer present in a segment is read or not*. At the time the index is being read, there is no concept of query-level config. In fact there is no query context at all. When the query arrives to the server, servers prepare a
These blocks are what Therefore There is no way, AFAIK, to make Operators aware of nullability defined in What this PR does is to change how *: In fact in master right now Schema and TableConfig are ignored for null value vector. The main change in this PR is precisely to change that to do not read the null index buffer if Schema or TableConfig disable the index. **: -2147483648 is the default null value for ints AlternativeThere is another way to implement a similar behavior, but requires a bit more code:
Same should be done in Lately, all operators would need to call This solution should work and contrary to this PR, it should not break backward compatibility in any way. But it would require deeper code modifications. I'll create a new PR with that approach so we can compare them. |
| } | ||
|
|
||
| public void registerTable(Schema schema, String tableName) { | ||
| public void registerTable(Schema schema, String tableName, boolean enableNullHandling) { |
There was a problem hiding this comment.
Is the intention to generate null value vector for the table? IMO we should always generate null value vector. We will persist the null value vector only when there are null values.
There was a problem hiding this comment.
IMO we should always generate null value vector. We will persist the null value vector only when there are null values.
I have to be clear. The changes in the test engine were copied from @walterddr PR. Anyway, I think it may be interesting to test the behavior when null vector was not created at ingestion time (IndexingConfig.isNullHandlingEnabled was false at ingestion time).
At the same time, I agree that this situation is very simple (if there is no null vector, we assume there is no null) and therefore I would say it would be better to assume in tests that by default tables IndexingConfig.isNullHandlingEnabled is true and allow to optionally disable it.
I disagree. It is useful to test how we behave in case the file doesn't have null vector and we ask for it. I know it is trivial (we just behave as it was not null value
There was a problem hiding this comment.
What exception did you get for these queries?
| tableConfig.getIndexingConfig().isNullHandlingEnabled() | ||
| ? IndexConfig.ENABLED | ||
| : IndexConfig.DISABLED)); | ||
| return (TableConfig tableConfig, Schema schema) -> { |
There was a problem hiding this comment.
That is not correct. There is a special case for null value vector in SegmentColumnarIndexCreator that creates the index depending on config.isNullHandlingEnabled without actually reading index configuration.
Does that mean we create null value vector for all columns only when it is configured in the table config, and the column level nullable is ignored; When reading the null value vector, we do check the column level nullable flag?
This behavior seems inconsistent and there is no way to only create null value vector for part of the columns. Can we make the write path logic the same as read path? i.e. check if it is enabled at column level, then decide whether to create the null value vector?
There was a problem hiding this comment.
Yesterday @walterddr and I were discussing about this PR. We think that #11935 is a better approach given it doesn't need to modify how to load the index. The changes there are very trivial (changing from storing boolean _enableNullHandling with NullMode _nullMode). The main issue is that the null handling code is not well organized and very repetitive, so I ended up modifying more than 1k lines.
But we think we are going to be focused on that PR instead of this one. What do you think?
pinot-query-planner/src/main/java/org/apache/pinot/query/type/TypeFactory.java
Outdated
Show resolved
Hide resolved
This is needed to keep backward compatibility with previous tests.
…testNG parallelism.
|
The same issue was solved in #11960 |
This is PR that has the intention of implementing column level nullability, as explained in #10381.
Current status:
nullHandlingEnabledwasfalse.Modify V2 leaves to forceenableNullHandling=true.Semantics in V1 are changed to:
*: Value of
FieldSpec.getNullable()