Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of partitioners in SegmentProcessorFramework #6021

Merged
merged 1 commit into from
Sep 17, 2020

Conversation

npawar
Copy link
Contributor

@npawar npawar commented Sep 15, 2020

#5753
Changing SegmentPrpcessorFramework config to take List of Partitioners. This is to account for table config's partitioning, which might need to be applied regardless of other partitioning configured. For example, we want to partition by date so as to align segments by time, but we also want to apply Murmur on some id column each to further partition and record in segment metadata.

for (Partitioner partitioner : _partitioners) {
partitions[p++] = partitioner.getPartition(reusableRow);
}
String partition = StringUtil.join("_", partitions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "_" here is very significant, right? It cannot be changed, and has to be used the same way across multiple components. Could you please declare it as. a final string in some Constants class as a partition separator or something?

And then re-use in tests

thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the framework relies on this to extract the partition info. But agree on introducing a constant for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is not significant at all. It can be changed, and is not used by any other components. It won't even matter beyond the scope of that joiner line. And hence I don't think it needs to be scoped out of this class, or even out of this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are partitioning the data, and the brokers have to construct the same partition id in the same order of columns and with the same partition function right?

Also, what is the use case for partitioning on more than one column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use case: say data in input segments is spread across 3 days. In the resulting segments, we want to create a segment for each day. Additionally, we want partitioning on some id column for query purposes.

Partitioning by time column is first step. This doesn't affect segment metadata or broker routing. This is simply used by the framework, and it's scope ends with the framework. It's merely helping create date aligned input files for Segment generation stage.
Partitioning by id column is second step. This is for queries. This will be whatever is in the table config. Only this partition will get set in the segment metadata. And even that will happen during segment creation.
See this comment and discussion:#5934 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something, I will try to find it in the design document.

  1. It seems to me that there may be at most 2 partitioners, so is making that a Pair better? Or, there should be a comment and an assert some place that the size is two.
  2. The word partition confused me into thinking that we are somehow respecting partitioning of data using a partition key with underscores (with brokers constructing some partition keys). That is clearly not the case here. Maybe rename this as a splitKey instead of partition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically, for the use case I described, it will be 2. But it need not be (there could be more custom logic). Also the json config spec has List of partitions, so I just continued it as List.
All these things are not set in stone as of now. We will be continuosly re-evaluating, optimizing and editing this framework, as we begin using it (for minion, and merge). It is difficult to predict right now and I prefer to not introduce restrictions on number of partitioners.

@npawar npawar merged commit cef2740 into apache:master Sep 17, 2020
@npawar npawar deleted the seg_processor_enhancements branch September 17, 2020 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants