-
Notifications
You must be signed in to change notification settings - Fork 318
Spark 442 #133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 442 #133
Conversation
…a Federation The new implementation stops relying on the storageStats property not being recognized as a valid property when using the $collStats aggregation operation when using a Data Federation endpoint. This end up making it impossible to use the SamplePartitioner, PaginateBySizePartitioner and AutoBucketPartitioner when using a Data Federation endpoint. From what I could see, the storageStats property was only used to access avgObjSize, which can be computed from the size and number of documents of a collection. When connected to a federated Mongo instance, stats are retrieved via the collStats command, whereas the $collStats aggregation operator is used for standard Mongo instances. This difference is due to the collStats command being faster, but deprecated starting from Mongo 6.2. However it doesn't seem to be deprecated for Data Federation as far as I can tell.
@guillotjulien please check out this alternative to #132 |
collStats | ||
.getDocument("partition", new BsonDocument()) | ||
.getNumber("size", new BsonInt32(0)))) | ||
.first()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue with that approach is that when you have multiple partitions (S3 or Mongo clusters) backing up your federated collection, you'll only use the info from the first partition. Whereas the collStats
command gives you the total size of all partitions.
Example:
[
{
ns: 'my_db.whatever',
partition: {
format: 'PARQUET',
attributes: { hash: '26146' },
size: 33190,
source: 's3://whatever-bucket/_hash=26146/part-00006-f18185ed-bbfe-46d0-9772-d5fe80b2a1e8.c000.snappy.parquet?delimiter=%2F®ion=eu-west-1'
},
count: 1
},
{
ns: 'my_db.whatever',
partition: {
format: 'PARQUET',
attributes: { hash: '26131' },
size: 31093,
source: 's3://whatever-bucket/_hash=26131/part-00005-28509ae5-f592-453b-a2e2-6e64bca01f27.c000.snappy.parquet?delimiter=%2F®ion=eu-west-1'
},
count: 1
},
...
]
I think this can be mitigated by summing the partition size and count of all partitions, then returning a BsonDocument
from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @guillotjulien - according to the atlas docs
The following example shows $collStats syntax for retrieving the total number of documents in the partitions.
use s3Db db.abc.aggregate([ {$collStats: {"count" : {} }} ])
Have you found that is not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rozza it does indeed, but it gives you one document per partition as shown in the example above. So if you want the total, you'd need to compute the sum of counts of all partitions.
e.g.
[
{ $collStats: { "count" : {} } },
{ $group: { _id: null, totalCount: { $sum: "$count" }, totalSize: { $sum: "$partition.size" } } }
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @guillotjulien I've added that to the latest commit.
Hi @rozza, I checked locally, and everything is working as expected. So LGTM thanks! |
@guillotjulien just to let you know 10.5.0 has been released and is on Maven central. |
Thanks for the fast release @rozza! |
Simplified approach to #132
Ensures the whole collection count is used when calculating the average size of documents.