column major segment build for columnar datasource#16727
column major segment build for columnar datasource#16727xiangfu0 merged 6 commits intoapache:masterfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #16727 +/- ##
============================================
+ Coverage 63.23% 63.25% +0.01%
+ Complexity 1429 1428 -1
============================================
Files 3104 3110 +6
Lines 183349 183718 +369
Branches 28096 28147 +51
============================================
+ Hits 115942 116206 +264
- Misses 58447 58535 +88
- Partials 8960 8977 +17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2d4eb54 to
b9dd171
Compare
...he/pinot/segment/local/segment/creator/impl/stats/ColumnarSegmentPreIndexStatsContainer.java
Outdated
Show resolved
Hide resolved
...he/pinot/segment/local/segment/creator/impl/stats/ColumnarSegmentPreIndexStatsContainer.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Outdated
Show resolved
Hide resolved
|
@Jackie-Jiang, could you review or assign somebody to review this. |
tarun11Mavani
left a comment
There was a problem hiding this comment.
Added few comments.
| * @param columnReader ColumnReader for the column data | ||
| * @throws IOException if indexing fails | ||
| */ | ||
| public void indexColumn(String columnName, ColumnReader columnReader) throws IOException { |
There was a problem hiding this comment.
Can add this method signature to the interface SegmentCreator.
|
|
||
| ColumnReader columnReader; | ||
|
|
||
| if (hasColumn(columnName)) { |
There was a problem hiding this comment.
We are skipping VirtualColumns in getAllColumnReaders but this can still be called for virtual column from somewhere else. We should skip VirtualColumns here as well.
There was a problem hiding this comment.
makes sense, adding an error scenario if invoked for virtual columns
...c/main/java/org/apache/pinot/segment/local/segment/readers/PinotSegmentColumnReaderImpl.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/ColumnReader.java
Show resolved
Hide resolved
...ocal/src/main/java/org/apache/pinot/segment/local/recordtransformer/DataTypeTransformer.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/ColumnReader.java
Outdated
Show resolved
Hide resolved
| * <p>Implementations should handle data type conversions, default values for new columns, | ||
| * and efficient column-wise data access patterns. | ||
| */ | ||
| public interface ColumnReader extends Closeable, Serializable { |
There was a problem hiding this comment.
This is mostly useful when reading from a column major file format such as Parquet. Can you check if this interface can be used for Parquet reader?
There was a problem hiding this comment.
The interface looks good, but one will need to create a ParquetSegmentColumnReaderFactory similar to PinotSegmentColumnReaderFactory which should be able to correctly initialise a ParquetSegmentColumnReaderImpl for a given parquet file to be read.
On a high level Parquet files are analogous to a Pinot segment as collection of columns and there is a similar low level columnar reader: https://github.com/apache/parquet-java/blob/f50dd6cb4b526cf4b585993c1b69a838cd8151f3/parquet-column/src/main/java/org/apache/parquet/column/ColumnReader.java#L40
There was a problem hiding this comment.
I do plan to implement this in future, but not an immediate priority. I can create an issue for this as follow up.
| * | ||
| * @return true if the current value is null, false otherwise | ||
| */ | ||
| boolean isNull(); |
There was a problem hiding this comment.
Why do we need this when next() can return null?
There was a problem hiding this comment.
Can be handled in next(), let me update. Originally, idea was to add custom null check handling in isNull(), e.g. "null" string --> null
...c/main/java/org/apache/pinot/segment/local/segment/readers/PinotSegmentColumnReaderImpl.java
Outdated
Show resolved
Hide resolved
|
Offline discussion with @Jackie-Jiang :
|
|
@Jackie-Jiang, this is ready to review. |
bb0337e to
e68a987
Compare
Jackie-Jiang
left a comment
There was a problem hiding this comment.
The overall approach looks good!
@KKcorps @noob-se7en Can you help review this?
| * @param columnReaders Map of column name to ColumnReader instances | ||
| */ | ||
| public ColumnarSegmentCreationDataSource(ColumnReaderFactory columnReaderFactory, | ||
| Map<String, ColumnReader> columnReaders) { |
There was a problem hiding this comment.
(code format) Can you reformat all the changes with Pinot Style
noob-se7en
left a comment
There was a problem hiding this comment.
mostly looks good, Added some minor comments. Will take another look tomorrow.
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/ColumnReaderFactory.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/pinot/segment/local/segment/readers/PinotSegmentColumnReaderFactory.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Outdated
Show resolved
Hide resolved
...he/pinot/segment/local/segment/creator/impl/stats/ColumnarSegmentPreIndexStatsContainer.java
Outdated
Show resolved
Hide resolved
...he/pinot/segment/local/segment/creator/impl/stats/ColumnarSegmentPreIndexStatsContainer.java
Outdated
Show resolved
Hide resolved
...he/pinot/segment/local/segment/creator/impl/stats/ColumnarSegmentPreIndexStatsContainer.java
Outdated
Show resolved
Hide resolved
| _columnStatsCollectorMap = new HashMap<>(); | ||
| _totalDocCount = -1; // indicates unset | ||
|
|
||
| initializeStatsCollectors(); |
There was a problem hiding this comment.
initializeStatsCollectors and collectColumnStats needn't be called in constructor right ?
initializeStatsCollectors should be called in init() method ?
and similarly collectColumnStats in build() method ?
There was a problem hiding this comment.
There isn't a requirement for lazy loading(initializeStatsCollectors) by invoking init separately.
Column stats are collected in build method, buildColumnar()
There was a problem hiding this comment.
Its cleaner and more easier to maintain to keep constructor light and have the methods do what they're supposed to.
There was a problem hiding this comment.
updating, I don't see any benefit in delaying the initialization by calling it immediately after object creation. These are not service start up objects but more to be used in the segment generation flow where there is not much benefit by split contructor logic into init method and calling it immediately.
...he/pinot/segment/local/segment/creator/impl/stats/ColumnarSegmentPreIndexStatsContainer.java
Show resolved
Hide resolved
| continue; | ||
| } | ||
|
|
||
| String columnName = fieldSpec.getName(); |
There was a problem hiding this comment.
we can probably abstract the common logic in SegmentPreIndexStatsCollectorImpl and this class to populate _columnStatsCollectorMap to avoid redundant code.
Also SegmentPreIndexStatsCollectorImpl has added support for NoDictColumnStatisticsCollector which needs to be added here too - https://github.com/apache/pinot/blob/master/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/SegmentPreIndexStatsCollectorImpl.java#L61
There was a problem hiding this comment.
SegmentPreIndexStatsCollectorImpl is heavily assumes row major iteration using method collectRow()
We are now using buildByColumn as default for mutable segment to immutable segment conversion and this is a new addition buildColumnar for immutable datasource to immutableSegment conversion in column major order so the above should get phased out after we are completely columnar with support for transformations.
There was a problem hiding this comment.
yes. but avoiding redundant code will be helpful to maintain the code until it exists and keep them consistent for common logic. eg: NoDictColumnStatisticsCollector was missing in this class.
There was a problem hiding this comment.
@rohityadav1993 we should do this change before merging the PR
There was a problem hiding this comment.
Aligned on reducing redundancy, extracting the common logic in a class called StatsCollectorUtils.
...he/pinot/segment/local/segment/creator/impl/stats/ColumnarSegmentPreIndexStatsContainer.java
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public void logStats() { | ||
| LOGGER.info("Columnar segment stats collection completed for {} columns with {} documents", |
There was a problem hiding this comment.
we should log the full stats here. same comment as earlier to refactor common code of this class and SegmentPreIndexStatsCollectorImpl to get same benefits without duplicate code.
There was a problem hiding this comment.
column stats get persisted as segment metadata in segment file.
There was a problem hiding this comment.
segment metadata won't be available on minion hosts. and might be updated on server if the segment is refreshed multiple times. and will be unavailable if segment is deleted ,etc
for retrospective debugging in such cases, logging the stats will be helpful. its currently done in SegmentPreIndexStatsCollectorImpl
| * <li>Resource management for all created readers</li> | ||
| * </ul> | ||
| */ | ||
| public class PinotSegmentColumnReaderFactory implements ColumnReaderFactory { |
There was a problem hiding this comment.
I think this class / interface needs to be refactored a bit
- init() shouldn't take targetSchema as that can be passed in constructor itself.
- getAllColumnReaders() calls createColumnReader() and both are public which isn't ideal. Rather init() should call createColumnReader() and createColumnReader() should be private. and getAllColumnReaders() should just get existing objects without creating new objects
There was a problem hiding this comment.
init is taking schema as it is trying to standardise how factory should be initialised. One is free to use any constructor signature needed. e.g. if data source is segment, the constructor is PinotSegmentColumnReaderFactory(IndexSegment indexSegment) and in case it was parquet, it can be ParquetColumnReaderFactory(File parquetFile)
renaming getAllColumnReaders() to createAllColumnReaders()
There was a problem hiding this comment.
I don't think createAllColumnReaders() should be public.
Rather init() should call createColumnReader() and createColumnReader() should be private. and getAllColumnReaders() should just get existing objects without creating new objects
There was a problem hiding this comment.
Revisited the interface for ColumnReaderFactory. There will be no create* methods exposed. The getMethods will return cached Readers and in future if needed one can add create methods if every time a new object needs to be created from the Factory. Same documentation in the interface.
...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Show resolved
Hide resolved
| @@ -209,7 +249,7 @@ public void init(SegmentGeneratorConfig config, SegmentCreationDataSource dataSo | |||
| } | |||
|
|
|||
| _ingestionSchemaValidator = | |||
| SchemaValidatorFactory.getSchemaValidator(_dataSchema, _recordReader.getClass().getName(), | |||
| SchemaValidatorFactory.getSchemaValidator(_dataSchema, readerClassName, | |||
There was a problem hiding this comment.
we're relying on getSchemaValidator to support null readerClassName which isn't ideal.
can we think of a more appropriate way to handle this ?
There was a problem hiding this comment.
I found no use of getIngestionSchemaValidator() outside test classes. There is no change being made in getSchemaValidator() which is creating a SchemaValidator
...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Show resolved
Hide resolved
...st/java/org/apache/pinot/segment/local/segment/creator/impl/ColumnarSegmentBuildingTest.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/pinot/segment/local/segment/readers/PinotSegmentColumnReaderImpl.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pinot/segment/local/segment/creator/impl/ColumnarSegmentBuildingTestBase.java
Show resolved
Hide resolved
...st/java/org/apache/pinot/segment/local/segment/creator/impl/ColumnarSegmentBuildingTest.java
Outdated
Show resolved
Hide resolved
...st/java/org/apache/pinot/segment/local/segment/creator/impl/ColumnarSegmentBuildingTest.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pinot/segment/local/segment/creator/impl/ColumnarSegmentBuildingTestBase.java
Show resolved
Hide resolved
...st/java/org/apache/pinot/segment/local/segment/creator/impl/ColumnarSegmentBuildingTest.java
Outdated
Show resolved
Hide resolved
...st/java/org/apache/pinot/segment/local/segment/creator/impl/ColumnarSegmentBuildingTest.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Show resolved
Hide resolved
7d88e80 to
03d189c
Compare
|
@noob-se7en, @KKcorps can you take a look again, all the review comments are addressed. |
...java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Show resolved
Hide resolved
|
@rohityadav1993 whats the plan for the open comments and failing tests and merge conflicts ? |
59b2136 to
bcf6997
Compare
|
Merge conflicts resolved and comments addressed. |
|
@Jackie-Jiang , @KKcorps please review and help merge this soon. |
|
@krishan1390 pls approve it. |
featureperformancerelease-notesFeature
Feature to build segment in columnar fashion for columnar input data sources like Pinot segment, Parquet files, etc.
PR split from previous previous abandoned PR: #16438 based on feedback.
Background
The requirement arises from issue: #16461:
In short, when minion refreshes a servers segment for schema or index config change. It rebuilds segment row wise. There are two bottlenecks observed:
pinot/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Line 573 in 868e2c9
pinot/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java
Line 266 in 868e2c9
Record wise generation also reprocesses transformations. This is not needed unless there is a change in transformation logic both in gather stats step and segment regeneration step. The bottlenecks can be seen in the CPU profile in the attached issue #16461
Design and Implementation
Changes
pinot/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/DataTypeTransformer.java
Line 70 in 3211115
Tests
Tests for comparing data correctness for columnar build with scnearios: no change, column addition, data type change, index update.
Performance benchmark
Script: c7e128a#diff-ee55ab7e1bad4347b27fd9721b5cf253c25aa80347147242238a88e7601abfcf
Usage