Skip to content

Add minMaxInvalid flag to avoid unnecessary needPreprocess#9238

Merged
npawar merged 2 commits intoapache:masterfrom
npawar:min_max_issue
Aug 31, 2022
Merged

Add minMaxInvalid flag to avoid unnecessary needPreprocess#9238
npawar merged 2 commits intoapache:masterfrom
npawar:min_max_issue

Conversation

@npawar
Copy link
Contributor

@npawar npawar commented Aug 18, 2022

Segment build: If min/max value is invalid (based on some definitions we have set like presence of , or empty), we skip putting the min/max value in segment metadata.properties.
Segment load: If either min or max is not available in segment metadata.properties, the ColumnMetadata for it returns null for both getMin and getMax. If both these values are null, but ColumnMinMaxGeneratorMode != NONE, we will mark this as "needPreprocess" and process the segment during segment load (either when it is ingested, or during server restart). As part of the preprocess however, we will again skip persisting the min/max value, since it is still invalid, and will always be for a segment with the same crc.

As a result, in the above combination (invalid min or max value), we always incur preprocessing cost during restart/load, even though the actual preprocessing is a noop. This may not seem much in Pinot using LocalSegmentFSDirectory, but can become very expensive in other implementations, where initing the SegmentDirectory might involve more expensive means. So we want to avoid getting into the else of this code when it is unnecessary. From BaseTableDataManager:

try {
      // If the segment is still kept by the server, then we can
      // either load it directly if it's still consistent with latest table config and schema;
      // or reprocess it to reflect latest table config and schema before loading.
      if (!ImmutableSegmentLoader.needPreprocess(segmentDirectory, indexLoadingConfig, schema)) {
        LOGGER.info("Segment: {} of table: {} is consistent with latest table config and schema", segmentName,
            _tableNameWithType);
      } else {
        LOGGER.info("Segment: {} of table: {} needs reprocess to reflect latest table config and schema", segmentName,
            _tableNameWithType);
        segmentDirectory.copyTo(indexDir);
        // Close the stale SegmentDirectory object and recreate it with reprocessed segment.
        closeSegmentDirectoryQuietly(segmentDirectory);
        ImmutableSegmentLoader.preprocess(indexDir, indexLoadingConfig, schema);
        segmentDirectory =
            initSegmentDirectory(segmentName, String.valueOf(zkMetadata.getCrc()), indexLoadingConfig, schema);
      }

Adding a flag in metadata.properties to detect this invalid min/max value case, so we can use it to avoid returning needPreprocess=true when preprocess would be a noop.

@npawar npawar added the bugfix label Aug 18, 2022
@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2022

Codecov Report

Merging #9238 (072e76e) into master (6fddffe) will increase coverage by 38.49%.
The diff coverage is 69.73%.

❗ Current head 072e76e differs from pull request most recent head 37f4503. Consider uploading reports for the commit 37f4503 to get more accurate results

@@              Coverage Diff              @@
##             master    #9238       +/-   ##
=============================================
+ Coverage     28.60%   67.10%   +38.49%     
- Complexity       53     4825     +4772     
=============================================
  Files          1844     1385      -459     
  Lines         98570    72251    -26319     
  Branches      15004    11578     -3426     
=============================================
+ Hits          28200    48483    +20283     
+ Misses        67667    20239    -47428     
- Partials       2703     3529      +826     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 67.10% <69.73%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...g/apache/pinot/common/metrics/ControllerGauge.java 0.00% <0.00%> (-97.83%) ⬇️
...a/org/apache/pinot/common/metrics/ServerMeter.java 98.46% <ø> (-1.54%) ⬇️
...t/common/response/broker/BrokerResponseNative.java 94.08% <ø> (+2.95%) ⬆️
...org/apache/pinot/common/utils/http/HttpClient.java 12.08% <ø> (-63.19%) ⬇️
...ata/manager/realtime/RealtimeTableDataManager.java 11.49% <0.00%> (-56.87%) ⬇️
...e/pinot/core/query/request/ServerQueryRequest.java 45.94% <0.00%> (-45.73%) ⬇️
...che/pinot/core/query/scheduler/QueryScheduler.java 71.81% <ø> (-10.74%) ⬇️
...he/pinot/core/transport/ChannelHandlerFactory.java 50.00% <0.00%> (-50.00%) ⬇️
...inot/query/runtime/operator/AggregateOperator.java 81.39% <ø> (+81.39%) ⬆️
.../columnminmaxvalue/ColumnMinMaxValueGenerator.java 73.68% <0.00%> (+73.68%) ⬆️
... and 1674 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@npawar npawar marked this pull request as ready for review August 18, 2022 15:55
Copy link
Contributor

@klsince klsince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Member

@jackjlli jackjlli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for fixing this!

@npawar npawar merged commit b837350 into apache:master Aug 31, 2022
@npawar npawar deleted the min_max_issue branch August 31, 2022 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants