Implemented BoundedColumnValue partition function#8224
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8224 +/- ##
============================================
+ Coverage 63.98% 69.53% +5.54%
- Complexity 4239 4241 +2
============================================
Files 1584 1631 +47
Lines 83250 85276 +2026
Branches 12608 12844 +236
============================================
+ Hits 53268 59293 +6025
+ Misses 26144 21847 -4297
- Partials 3838 4136 +298
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
bec3162 to
ef0384e
Compare
ef0384e to
c840ebf
Compare
|
Hi @kmozaid thanks for taking the time to make this contribution. Can you explain what problem this solves? Is it because you already have a partitioning and you want to maintain locality within partitions? |
Hi @richardstartin , We have a table where data is being ingested from multiple sources. (these multiple sources pushes data to same kafka topic). Data is kept for 5 days in realtime table and then moved offline table by minion task. We want to keep data from these sources in separate segments for offline table. There is a column which identifies the source. Just a note regarding the image - The partitionId mentioned in image are source1, source2 and source 3 although they would be integer value based on the their position in configured |
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/ColumnStatistics.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/ColumnPartitionConfig.java
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/ColumnPartitionConfig.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/ColumnPartitionConfig.java
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/spi/partition/BoundedColumnValuePartitionFunction.java
Show resolved
Hide resolved
...oker/src/main/java/org/apache/pinot/broker/routing/segmentpruner/PartitionSegmentPruner.java
Outdated
Show resolved
Hide resolved
| * } | ||
| * } | ||
| * } | ||
| * With this partition config on column "subject", partitionId would be 1 for Maths, 2 for English and so on. |
There was a problem hiding this comment.
Should we consider putting other values as the last partition in case all values are already configured, and no value goes into partition 0?
There was a problem hiding this comment.
I wanted to keep a fixed value of partitionId for unknown column values. When putting other/unknown values as last partition, the partition id would different every time partition config is changed to add/remove another column value.
When backfilling data for a particular source/tenant (in our example, a subject), we can easily ignore segments which has partitionId set to 0 in segment metadata because we know for sure that it could have data for many different subjects.
...rc/main/java/org/apache/pinot/segment/spi/partition/BoundedColumnValuePartitionFunction.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/spi/partition/BoundedColumnValuePartitionFunction.java
Outdated
Show resolved
Hide resolved
...i/src/main/java/org/apache/pinot/segment/spi/partition/metadata/ColumnPartitionMetadata.java
Outdated
Show resolved
Hide resolved
...egment-spi/src/main/java/org/apache/pinot/segment/spi/index/metadata/ColumnMetadataImpl.java
Outdated
Show resolved
Hide resolved
Jackie-Jiang
left a comment
There was a problem hiding this comment.
LGTM otherwise. Please reformat the changes with latest Pinot Style
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/ColumnPartitionConfig.java
Outdated
Show resolved
Hide resolved
...gment-spi/src/main/java/org/apache/pinot/segment/spi/partition/PartitionFunctionFactory.java
Outdated
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/partition/PartitionFunction.java
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/spi/partition/BoundedColumnValuePartitionFunction.java
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/spi/partition/BoundedColumnValuePartitionFunction.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/spi/partition/BoundedColumnValuePartitionFunction.java
Outdated
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/StatsCollectorConfig.java
Outdated
Show resolved
Hide resolved
...egment-spi/src/main/java/org/apache/pinot/segment/spi/index/metadata/ColumnMetadataImpl.java
Outdated
Show resolved
Hide resolved
...i/src/main/java/org/apache/pinot/segment/spi/partition/metadata/ColumnPartitionMetadata.java
Outdated
Show resolved
Hide resolved
| */ | ||
| public ColumnPartitionMetadata(String functionName, int numPartitions, Set<Integer> partitions) { | ||
| public ColumnPartitionMetadata(String functionName, int numPartitions, Set<Integer> partitions, | ||
| Map<String, String> functionConfig) { |
There was a problem hiding this comment.
(code format)
| Map<String, String> functionConfig) { | |
| @Nullable Map<String, String> functionConfig) { |
90b2317 to
5bc1ee8
Compare
5bc1ee8 to
f2f131f
Compare
|
Thanks @Jackie-Jiang and @richardstartin for your prompt reviews, appreciate it! |

Description
This PR adds a new implementation of
PartitionFunctionwhich is used to partition segments. The new partition function namedBoundedColumnValuecan be used to partition segments on column values and still generating partitionId of integer type.Example Usage -
PartitionId is generated based on position in columnValues. PartitionId would 1 for Maths, 2 for English and so on.
PartitionId 0 is reserved for any other subject which are not present in given
columnValues. The different column values can be specified using pipe (|) separation. This additional partitionfunctionConfigis persisted in metadata.properties and segment metadata in zookeeper. Broker can also use this function to prune segments.Upgrade Notes
Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)
Does this PR fix a zero-downtime upgrade introduced earlier?
Does this PR otherwise need attention when creating release notes? Things to consider:
Release Notes
Added new partition function called
BoundedColumnValueto be able to partition segments based on column value.Documentation
Yes, will create another PR in pinot-docs repo.