Auto Bucket Partitioner #120

rozza · 2024-06-26T17:04:56Z

A $sample based partitioner that provides support for all collection types. Supports partitioning across single or multiple fields, including nested fields.

The logic for the partitioner is as follows:

Calculate the number of documents per partition. Runs a $collStats aggregation to get the average document size.
Determines the total count of documents. Uses the $collStats count or by running a countDocuments query if the user supplies their own aggregation.pipeline configuration.
Determines the number of partitions. Calculated as: count / number of documents per partition
Determines the number of documents to $sample. Calculated as: samples per partition * number of partitions.

Creates the aggregation pipeline to generate the partitions.

[{$match: <the $match stage of the users aggregation pipeline - iff the first stage is a $match>},
 {$sample: <number of documents to $sample>},
 {$addFields: {<partition key projection field>: {<'i': '$fieldList[i]' ...>}} // Only added iff fieldList.size() > 1
 {$bucketAuto: {
        groupBy: <partition key projection field>,
        buckets: <number of partitions>
    }
 }
]

Configurations:

fieldList: The field list to be used for partitioning.
Either a single field name or a list of comma separated fields.
Defaults to: "_id".
chunkSize: The average size (MB) for each partition.
Note: Uses the average document size to determine the number of documents per partition so
partitions may not be even.
Defaults to: 64.
samplesPerPartition: The number of samples to take per partition.
Defaults to: 10.
partitionKeyProjectionField: The field name to use for a projected field that contains all the
fields used to partition the collection.
Defaults to: "__idx".
Recommended to only change if there already is a "__idx" field in the document.

Partitions are calculated as logical ranges. When using sharded clusters these will map closely to ranged chunks.
When using with hashed shard keys these logical ranges require broadcast operations.

Similar to the SamplePartitioner however uses the $bucketAuto aggregation stage to generate the partition bounds.

SPARK-356

A new `$sample` based partitioner that provides support for all collection types. Supports partitioning across single or multiple fields, including nested fields. The logic for the partitioner is as follows: - Calculate the number of documents per partition. Runs a `$collStats` aggregation to get the average document size. - Determines the total count of documents. Uses the `$collStats` count or by running a `countDocuments` query if the user supplies their own `aggregation.pipeline` configuration. - Determines the number of partitions. Calculated as: `count / number of documents per partition` - Determines the number of documents to $sample. Calculated as: `samples per partition * number of partitions`. - Creates the aggregation pipeline to generate the partitions. ``` [{$match: <the $match stage of the users aggregation pipeline - iff the first stage is a $match>}, {$sample: <number of documents to $sample>}, {$addFields: {<partition key projection field>: {<'i': '$fieldList[i]' ...>}} // Only added iff fieldList.size() > 1 {$bucketAuto: { groupBy: <partition key projection field>, buckets: <number of partitions> } } ] ``` Configurations: - `fieldList`: The field list to be used for partitioning. Either a single field name or a list of comma separated fields. Defaults to: "_id". - `chunkSize`: The average size (MB) for each partition. Note: Uses the average document size to determine the number of documents per partition so partitions may not be even. Defaults to: 64. - `samplesPerPartition`: The number of samples to take per partition. Defaults to: 10. - `partitionKeyProjectionField`: The field name to use for a projected field that contains all the fields used to partition the collection. Defaults to: "__idx". Recommended to only change if there already is a "__idx" field in the document. Partitions are calculated as logical ranges. When using sharded clusters these will map closely to ranged chunks. When using with hashed shard keys these logical ranges require broadcast operations. Similar to the SamplePartitioner however uses the $bucketAuto aggregation stage to generate the partition bounds. SPARK-356

src/integrationTest/java/com/mongodb/spark/sql/connector/mongodb/MongoSparkConnectorHelper.java

src/main/java/com/mongodb/spark/sql/connector/read/partitioner/AutoBucketPartitioner.java

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

src/main/java/com/mongodb/spark/sql/connector/read/partitioner/AutoBucketPartitioner.java

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

rozza force-pushed the SPARK-356 branch from ec32929 to 9d72ca1 Compare June 26, 2024 17:38

rozza marked this pull request as ready for review June 27, 2024 12:26

rozza requested a review from stIncMale June 27, 2024 12:26

stIncMale requested changes Jul 8, 2024

View reviewed changes

rozza and others added 11 commits July 9, 2024 09:50

Update src/main/java/com/mongodb/spark/sql/connector/read/partitioner…

618f185

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

Update src/main/java/com/mongodb/spark/sql/connector/read/partitioner…

1cebbb7

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

Update src/main/java/com/mongodb/spark/sql/connector/read/partitioner…

7d1d216

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

Update src/main/java/com/mongodb/spark/sql/connector/read/partitioner…

3e02060

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

Update src/main/java/com/mongodb/spark/sql/connector/read/partitioner…

186c4e2

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

Update src/main/java/com/mongodb/spark/sql/connector/read/partitioner…

d768350

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

Update src/main/java/com/mongodb/spark/sql/connector/read/partitioner…

b13e55d

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

Update src/main/java/com/mongodb/spark/sql/connector/read/partitioner…

b004878

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

Update src/main/java/com/mongodb/spark/sql/connector/read/partitioner…

758a511

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

PR updates

a158e13

DRY up loadSampleData

423c4b7

rozza requested a review from stIncMale July 9, 2024 11:16

stIncMale requested changes Jul 9, 2024

View reviewed changes

src/main/java/com/mongodb/spark/sql/connector/read/partitioner/AutoBucketPartitioner.java Outdated Show resolved Hide resolved

src/main/java/com/mongodb/spark/sql/connector/read/partitioner/AutoBucketPartitioner.java Show resolved Hide resolved

Update src/main/java/com/mongodb/spark/sql/connector/read/partitioner…

92580e3

…/AutoBucketPartitioner.java Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>

rozza requested a review from stIncMale July 10, 2024 08:16

stIncMale approved these changes Jul 10, 2024

View reviewed changes

rozza merged commit 9105cf4 into mongodb:main Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto Bucket Partitioner #120

Auto Bucket Partitioner #120

Uh oh!

rozza commented Jun 26, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Auto Bucket Partitioner #120

Auto Bucket Partitioner #120

Uh oh!

Conversation

rozza commented Jun 26, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants