Skip to content

Conversation

rozza
Copy link
Member

@rozza rozza commented Apr 24, 2025

Simplified approach to #132

Ensures the whole collection count is used when calculating the average size of documents.

guillotjulien and others added 2 commits April 23, 2025 13:24
…a Federation

The new implementation stops relying on the storageStats property not being recognized as a valid property when using the $collStats aggregation operation when using a Data Federation endpoint.
This end up making it impossible to use the SamplePartitioner, PaginateBySizePartitioner and AutoBucketPartitioner when using a Data Federation endpoint.

From what I could see, the storageStats property was only used to access avgObjSize, which can be computed from the size and number of documents of a collection.

When connected to a federated Mongo instance, stats are retrieved via the collStats command, whereas the $collStats aggregation operator is used for standard Mongo instances.
This difference is due to the collStats command being faster, but deprecated starting from Mongo 6.2. However it doesn't seem to be deprecated for Data Federation as far as I can tell.
@rozza
Copy link
Member Author

rozza commented Apr 24, 2025

@guillotjulien please check out this alternative to #132

collStats
.getDocument("partition", new BsonDocument())
.getNumber("size", new BsonInt32(0))))
.first())
Copy link
Contributor

@guillotjulien guillotjulien Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with that approach is that when you have multiple partitions (S3 or Mongo clusters) backing up your federated collection, you'll only use the info from the first partition. Whereas the collStats command gives you the total size of all partitions.

Example:

[
  {
    ns: 'my_db.whatever',
    partition: {
      format: 'PARQUET',
      attributes: { hash: '26146' },
      size: 33190,
      source: 's3://whatever-bucket/_hash=26146/part-00006-f18185ed-bbfe-46d0-9772-d5fe80b2a1e8.c000.snappy.parquet?delimiter=%2F&region=eu-west-1'
    },
    count: 1
  },
  {
    ns: 'my_db.whatever',
    partition: {
      format: 'PARQUET',
      attributes: { hash: '26131' },
      size: 31093,
      source: 's3://whatever-bucket/_hash=26131/part-00005-28509ae5-f592-453b-a2e2-6e64bca01f27.c000.snappy.parquet?delimiter=%2F&region=eu-west-1'
    },
    count: 1
  },
  ...
]

I think this can be mitigated by summing the partition size and count of all partitions, then returning a BsonDocument from that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @guillotjulien - according to the atlas docs

The following example shows $collStats syntax for retrieving the total number of documents in the partitions.

use s3Db
db.abc.aggregate([ {$collStats: {"count" : {} }} ])

Have you found that is not the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rozza it does indeed, but it gives you one document per partition as shown in the example above. So if you want the total, you'd need to compute the sum of counts of all partitions.

e.g.

[
  { $collStats: { "count" : {} } },
  { $group: { _id: null, totalCount: { $sum: "$count" }, totalSize: { $sum: "$partition.size" } }  }
]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @guillotjulien I've added that to the latest commit.

@guillotjulien
Copy link
Contributor

Hi @rozza, I checked locally, and everything is working as expected. So LGTM thanks!

@rozza rozza merged commit f9fbb5e into mongodb:main Apr 30, 2025
24 checks passed
@rozza rozza deleted the SPARK-442 branch April 30, 2025 08:17
@rozza
Copy link
Member Author

rozza commented May 8, 2025

@guillotjulien just to let you know 10.5.0 has been released and is on Maven central.

@guillotjulien
Copy link
Contributor

Thanks for the fast release @rozza!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants