Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minion taskExecutor for RealtimeToOfflineSegments task #6050

Merged
merged 4 commits into from
Sep 30, 2020

Conversation

npawar
Copy link
Contributor

@npawar npawar commented Sep 23, 2020

#5753

The minion task executor which receives

  1. segments (from realtime tables)
  2. a time window
    And then creates Pinot segments using data from that time window.

Uses the SegmentProcessorFramework. Applies:

  1. Time column transformations as configured in PinotTaskConfig
  2. Partitioning as specified in table config
  3. Data sorting as specified in table config
  4. Aggregations across common dimension+time, as configured in the PinotTaskConfig

Next steps:
RealtimeToOfflineSegmentsTaskGenerator

@npawar npawar force-pushed the realtime_offline_minion branch from 96fa857 to 1010a8f Compare September 23, 2020 17:03
Copy link
Contributor

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some metrics? Number of tables/segments merged, time taken, etc. I think this needs some thought, so you can do it in a subsequent PR, but please add a TODO somewhere, or create an issue, or something to track it.

}

if (sortedColumns != null) {
for (String column : sortedColumns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this check, given that we are pulling from tableconfig (at least in this task). If you are checking, then you should perhaps also check that this sorted column is not same as one of the aggregation columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent table config validation efforts showed that we hadn't been checking for validity of column names used in indexing config. I want to prevent failing in the SegmentProcessorFramework as much as possible, hence the check. If this check is not done, then we'd see failure in reduce step, after a lot of wasted computation.
I don't think we need to check for sorted column not being a metric, as nothing is really stopping a user from setting it that way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant if a column is present both as sortedColumn, but specified in aggregateConfigs above. Not sure if that case is handled correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a valid scenario imo

/**
* Construct a {@link RecordFilterConfig} by setting a filter function on the time column, for extracting data between window start/end
*/
private RecordFilterConfig getRecordFilterConfigForWindow(long windowStartMs, long windowEndMs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we will drop the late data once the data from a window get moved to offline?

e.g.

  1. day1 gets moved to offline table
  2. day1 data arrived late (this row shows up in the result because of realtime table will index and serve this row)
  3. day2 gets moved to offline table (the above late data will be dropped by the filter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this will happen. this has been called out in the design doc: https://docs.google.com/document/d/1-e_9aHQB4HXS38ONtofdxNvMsGmAoYfSnc2LP88MbIc/edit#heading=h.5lkm0pm6vp7o
We discussed that the only way to handle this is for user to set a sufficient "bufferTime".

Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@npawar npawar merged commit a0dcc66 into apache:master Sep 30, 2020
@npawar npawar deleted the realtime_offline_minion branch September 30, 2020 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants