Segment processing framework #5934

npawar · 2020-08-27T22:29:38Z

Description

#5753
A Segment Processing Framework to convert "m" segments to "n" segments
The phases of the Segment Processor are

Map

RecordTransformation (using transform functions)
Record filtering (using filter functions)
Partitioning (Column value based, transform function based, table config's partition config based)

Reduce

Rollup/Concat records
Split into parts
Sort

Segment generation

A SegmentProcessorFrameworkCommand is provided to run this on demand. Run using command
bin/pinot-admin.sh SegmentProcessorFramework -segmentProcessorFrameworkSpec /<path>/spec.json
where spec.json is

{
  "inputSegmentsDir": "/<base_dir>/segmentsDir",
  "outputSegmentsDir": "/<base_dir>/outputDir/",
  "schemaFile": "/<base_dir>/schema.json",
  "tableConfigFile": "/<base_dir>/table.json",
  "recordTransformerConfig": {
    "transformFunctionsMap": {
      "epochMillis": "round(epochMillis, 86400000)" // round to nearest day
    }
  },
  "recordFilterConfig": {
    "recordFilterType": "FILTER_FUNCTION",
    "filterFunction": "Groovy({epochMillis != \"1597795200000\"}, epochMillis)"
  },
  "partitioningConfig": {
    "partitionerType": "COLUMN_VALUE", // partition on epochMillis
    "columnName": "epochMillis"
  },
  "collectorConfig": {
    "collectorType": "ROLLUP", // rollup clicks by summing
    "aggregatorTypeMap": {
      "clicks": "SUM"
    }
  },
  "segmentConfig": {
    "maxNumRecordsPerSegment": 200_000
  }
}

Note:

Currently this framework attempts to do no parallelism in the map/reduce/segment creation jobs. Each input file will be processed sequentially in map stage, each part will be executed sequentially in reduce, and each segment will be built one after another. We can change this in the future if the need arises to make this more advanced.
The framework makes the assumption that there's enough memory to hold all records of a partition in memory, during rollups in reducer. A limit of 5M records has been set on the Reducer as the number of records to collect before forcing a flush, as a safety measure. In future we could consider using off heap processing, if memory becomes a problem.

This framework will typically be used by minion tasks, which want to perform some processing on segments
(eg task which merges segments, tasks which aligns segments per time boundaries etc). The existing Segment merge jobs can be changed to use this framework.

Pending
Enhancements like (Added TODOs in code)

Put null in GenericRecord if nullValueFields contains the field
Interface out underlying file format (currently avro)
Dedup
Using off-heap based implementation for aggregation/sorting in the reduce
2 step partitioner 1) Apply custom partitioner 2) Apply table config partitioner. Combine both to get final partition.
Configs for segment name (like prefix)

codecov-commenter · 2020-08-27T23:13:40Z

Codecov Report

Merging #5934 into master will decrease coverage by 23.23%.
The diff coverage is 51.23%.

@@             Coverage Diff             @@
##           master    #5934       +/-   ##
===========================================
- Coverage   66.44%   43.20%   -23.24%     
===========================================
  Files        1075     1210      +135     
  Lines       54773    62540     +7767     
  Branches     8168     9529     +1361     
===========================================
- Hits        36396    27023     -9373     
- Misses      15700    33081    +17381     
+ Partials     2677     2436      -241

Flag	Coverage Δ
#integration	`43.20% <51.23%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ot/broker/broker/AllowAllAccessControlFactory.java	`100.00% <ø> (ø)`
.../helix/BrokerUserDefinedMessageHandlerFactory.java	`52.83% <0.00%> (-13.84%)`	⬇️
...ava/org/apache/pinot/client/AbstractResultSet.java	`26.66% <0.00%> (-30.48%)`	⬇️
.../main/java/org/apache/pinot/client/Connection.java	`22.22% <0.00%> (-26.62%)`	⬇️
.../org/apache/pinot/client/ResultTableResultSet.java	`24.00% <0.00%> (-10.29%)`	⬇️
.../org/apache/pinot/common/lineage/LineageEntry.java	`0.00% <0.00%> (ø)`
...apache/pinot/common/lineage/LineageEntryState.java	`0.00% <0.00%> (ø)`
...rg/apache/pinot/common/lineage/SegmentLineage.java	`0.00% <0.00%> (ø)`
...ache/pinot/common/lineage/SegmentLineageUtils.java	`0.00% <0.00%> (ø)`
...ot/common/messages/RoutingTableRebuildMessage.java	`0.00% <0.00%> (ø)`
... and 1132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fd70fe...eece981. Read the comment docs.

snleee

some comments for high-level discussion

pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentMapper.java

...org/apache/pinot/core/segment/processing/transformer/TransformFunctionRecordTransformer.java

pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentMapper.java

pinot-core/src/main/java/org/apache/pinot/core/data/function/FunctionEvaluator.java

pinot-core/src/main/java/org/apache/pinot/core/data/function/InbuiltFunctionEvaluator.java

...core/src/main/java/org/apache/pinot/core/segment/processing/partitioner/PartitionFilter.java

...e/src/main/java/org/apache/pinot/core/segment/processing/partitioner/PartitionerFactory.java

...t-core/src/main/java/org/apache/pinot/core/segment/processing/collector/CollectorConfig.java

pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentReducer.java

pinot-core/src/main/java/org/apache/pinot/core/segment/processing/collector/Collector.java

...-core/src/main/java/org/apache/pinot/core/segment/processing/collector/CollectorFactory.java

...t-core/src/main/java/org/apache/pinot/core/segment/processing/collector/RollupCollector.java

pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentReducer.java

...e/src/main/java/org/apache/pinot/core/segment/processing/partitioner/PartitionerFactory.java

...core/src/main/java/org/apache/pinot/core/segment/processing/utils/SegmentProcessorUtils.java

Jackie-Jiang

Good job splitting the work into multiple modules and make it very easy to extend. Mostly minor comments.
One high level comment: since we always use Avro as the intermediate format, should we directly work on GenericRecord instead of converting back and forth between GenericRow and GenericRecord?
Also, we might want to support more input formats other than Pinot segments. We can do it as the next step.

pinot-core/src/main/java/org/apache/pinot/core/segment/processing/collector/Collector.java

...-core/src/main/java/org/apache/pinot/core/segment/processing/collector/GenericRowSorter.java

...t-core/src/main/java/org/apache/pinot/core/segment/processing/collector/CollectorConfig.java

pinot-core/src/main/java/org/apache/pinot/core/segment/processing/collector/Collector.java

pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentMapper.java

...src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentProcessorConfig.java

...core/src/main/java/org/apache/pinot/core/segment/processing/utils/SegmentProcessorUtils.java

snleee

Overall, I like that all the core components are interfaced out and easy to extend. I have put some comments. Some of them are questions or points that I would like to discuss.

...t-core/src/main/java/org/apache/pinot/core/segment/processing/collector/RollupCollector.java

...core/src/main/java/org/apache/pinot/core/segment/processing/partitioner/NoOpPartitioner.java

...c/main/java/org/apache/pinot/core/segment/processing/partitioner/ColumnValuePartitioner.java

...c/main/java/org/apache/pinot/core/segment/processing/partitioner/TableConfigPartitioner.java

...src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentProcessorConfig.java

.../main/java/org/apache/pinot/core/segment/processing/framework/SegmentProcessorFramework.java

npawar · 2020-09-15T01:15:14Z

Overall, I like that all the core components are interfaced out and easy to extend. I have put some comments. Some of them are questions or points that I would like to discuss.

Addressed the comments. Added TODOs in code and description for those that will be handled in future PRs

snleee

LGTM. Thank you for addressing all the comments!

npawar added 3 commits August 21, 2020 18:40

Segment Processor Framework

f511f1f

Tests and pinot-admin command

b6d941c

MV tests

a4e0f72

npawar requested review from Jackie-Jiang and snleee August 27, 2020 22:31

Javadoc and imports

eece981

npawar force-pushed the segment_processing_framework branch from 80e1c4d to eece981 Compare August 27, 2020 22:39

snleee reviewed Aug 27, 2020

View reviewed changes

Jackie-Jiang reviewed Sep 1, 2020

View reviewed changes

mcvsubbu reviewed Sep 1, 2020

View reviewed changes

npawar force-pushed the segment_processing_framework branch from 2c642d6 to eece981 Compare September 1, 2020 17:58

npawar added 3 commits September 1, 2020 10:59

RecordFilter and remove PartitionFilter

cd9e39e

Sort in Collector

41ac0a9

Javadocs, rename, skip virtual columns

aa42f94

npawar requested review from Jackie-Jiang and snleee September 3, 2020 20:17

RoundRobin partitioner and flush on maxRecordsPerPart

38a8acc

Jackie-Jiang approved these changes Sep 9, 2020

View reviewed changes

snleee reviewed Sep 10, 2020

View reviewed changes

npawar added 4 commits September 10, 2020 14:53

Review comments

0250c7c

Use JsonCreator instead of JsonDeserialize

2e2dabc

End-end test

8ce9e5d

Add TODOs for open items

d433fd0

snleee approved these changes Sep 15, 2020

View reviewed changes

npawar merged commit 41de9a6 into apache:master Sep 15, 2020

npawar deleted the segment_processing_framework branch September 15, 2020 16:49

npawar mentioned this pull request Sep 16, 2020

List of partitioners in SegmentProcessorFramework #6021

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segment processing framework #5934

Segment processing framework #5934

npawar commented Aug 27, 2020 •

edited

Loading

codecov-commenter commented Aug 27, 2020

snleee left a comment

Jackie-Jiang left a comment

snleee left a comment

npawar commented Sep 15, 2020

snleee left a comment

Segment processing framework #5934

Segment processing framework #5934

Conversation

npawar commented Aug 27, 2020 • edited Loading

Description

codecov-commenter commented Aug 27, 2020

Codecov Report

snleee left a comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

snleee left a comment

Choose a reason for hiding this comment

npawar commented Sep 15, 2020

snleee left a comment

Choose a reason for hiding this comment

npawar commented Aug 27, 2020 •

edited

Loading