Spark 442 #133

rozza · 2025-04-24T13:00:52Z

Simplified approach to #132

Ensures the whole collection count is used when calculating the average size of documents.

…a Federation The new implementation stops relying on the storageStats property not being recognized as a valid property when using the $collStats aggregation operation when using a Data Federation endpoint. This end up making it impossible to use the SamplePartitioner, PaginateBySizePartitioner and AutoBucketPartitioner when using a Data Federation endpoint. From what I could see, the storageStats property was only used to access avgObjSize, which can be computed from the size and number of documents of a collection. When connected to a federated Mongo instance, stats are retrieved via the collStats command, whereas the $collStats aggregation operator is used for standard Mongo instances. This difference is due to the collStats command being faster, but deprecated starting from Mongo 6.2. However it doesn't seem to be deprecated for Data Federation as far as I can tell.

rozza · 2025-04-24T13:02:15Z

@guillotjulien please check out this alternative to #132

guillotjulien · 2025-04-24T14:41:56Z

src/main/java/com/mongodb/spark/sql/connector/read/partitioner/PartitionerHelper.java

+                      collStats
+                          .getDocument("partition", new BsonDocument())
+                          .getNumber("size", new BsonInt32(0))))
+                  .first())


One issue with that approach is that when you have multiple partitions (S3 or Mongo clusters) backing up your federated collection, you'll only use the info from the first partition. Whereas the collStats command gives you the total size of all partitions.

Example:

[ { ns: 'my_db.whatever', partition: { format: 'PARQUET', attributes: { hash: '26146' }, size: 33190, source: 's3://whatever-bucket/_hash=26146/part-00006-f18185ed-bbfe-46d0-9772-d5fe80b2a1e8.c000.snappy.parquet?delimiter=%2F&region=eu-west-1' }, count: 1 }, { ns: 'my_db.whatever', partition: { format: 'PARQUET', attributes: { hash: '26131' }, size: 31093, source: 's3://whatever-bucket/_hash=26131/part-00005-28509ae5-f592-453b-a2e2-6e64bca01f27.c000.snappy.parquet?delimiter=%2F&region=eu-west-1' }, count: 1 }, ... ]

I think this can be mitigated by summing the partition size and count of all partitions, then returning a BsonDocument from that.

Thanks @guillotjulien - according to the atlas docs

The following example shows $collStats syntax for retrieving the total number of documents in the partitions.

use s3Db db.abc.aggregate([ {$collStats: {"count" : {} }} ])

Have you found that is not the case?

@rozza it does indeed, but it gives you one document per partition as shown in the example above. So if you want the total, you'd need to compute the sum of counts of all partitions.

e.g.

[ { $collStats: { "count" : {} } }, { $group: { _id: null, totalCount: { $sum: "$count" }, totalSize: { $sum: "$partition.size" } } } ]

Thanks @guillotjulien I've added that to the latest commit.

guillotjulien · 2025-04-29T11:54:45Z

Hi @rozza, I checked locally, and everything is working as expected. So LGTM thanks!

rozza · 2025-05-08T08:33:28Z

@guillotjulien just to let you know 10.5.0 has been released and is on Maven central.

guillotjulien · 2025-05-08T11:31:41Z

Thanks for the fast release @rozza!

guillotjulien and others added 2 commits April 23, 2025 13:24

Remove buildInfo query and fallback to data lake implementation

32fffdf

rozza mentioned this pull request Apr 24, 2025

Use computed average document size to support usage through Atlas Data Federation #132

Closed

guillotjulien reviewed Apr 24, 2025

View reviewed changes

Ensure ADF counts are aggregated

48c23af

rozza merged commit f9fbb5e into mongodb:main Apr 30, 2025
24 checks passed

rozza deleted the SPARK-442 branch April 30, 2025 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark 442 #133

Spark 442 #133

Uh oh!

rozza commented Apr 24, 2025 •

edited

Loading

Uh oh!

rozza commented Apr 24, 2025

Uh oh!

guillotjulien Apr 24, 2025 •

edited

Loading

Uh oh!

rozza Apr 24, 2025

Uh oh!

guillotjulien Apr 25, 2025

Uh oh!

rozza Apr 28, 2025

Uh oh!

guillotjulien commented Apr 29, 2025

Uh oh!

Uh oh!

rozza commented May 8, 2025

Uh oh!

guillotjulien commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Spark 442 #133

Spark 442 #133

Uh oh!

Conversation

rozza commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rozza commented Apr 24, 2025

Uh oh!

guillotjulien Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rozza Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

guillotjulien Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

rozza Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

guillotjulien commented Apr 29, 2025

Uh oh!

Uh oh!

rozza commented May 8, 2025

Uh oh!

guillotjulien commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rozza commented Apr 24, 2025 •

edited

Loading

guillotjulien Apr 24, 2025 •

edited

Loading