Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Allow configuring DLO ranking params (#228)
## Summary Support additional 2 optional arguments that configure DLO strategies ranking. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [x] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. Create 2 test tables and generate strategies ``` scala> spark.sql("show tblproperties openhouse.db.test1 ('write.data-layout.strategies')").show(2000, false) +----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |key |value | +----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |write.data-layout.strategies|[{"score":9.99998858771037,"entropy":2.77080950301652896E17,"cost":0.5000005706151327,"gain":5.0,"config":{"targetByteSize":526385152,"minByteSizeRatio":0.75,"maxByteSizeRatio":10.0,"minInputFiles":5,"maxConcurrentFileGroupRewrites":5,"partialProgressEnabled":true,"partialProgressMaxCommits":1,"maxFileGroupSizeBytes":107374182400}}]| +----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ scala> spark.sql("show tblproperties openhouse.db.test2 ('write.data-layout.strategies')").show(2000, false) +----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |key |value | +----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |write.data-layout.strategies|[{"score":9.99998858771037,"entropy":2.77080950301652896E17,"cost":0.5000005706151327,"gain":5.0,"config":{"targetByteSize":526385152,"minByteSizeRatio":0.75,"maxByteSizeRatio":10.0,"minInputFiles":5,"maxConcurrentFileGroupRewrites":5,"partialProgressEnabled":true,"partialProgressMaxCommits":1,"maxFileGroupSizeBytes":107374182400}}]| +----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` Run DLO execution and check only 1 strategy is picked ``` 24/10/9 7:49:24@oh-hadoop-spark:~$ docker compose --profile with_jobs_scheduler run openhouse-jobs-scheduler - --type DATA_LAYOUT_STRATEGY_EXECUTION --cluster local --tablesURL http://openhouse-tables:8080/ --jobsURL http://openhouse-jobs:8080/ --tableMinAgeThresholdHours 0 --maxCostBudgetGbHrs 100 --maxStrategiesCount 1 2024-10-10 02:50:26 INFO JobsScheduler:111 - Starting scheduler 2024-10-10 02:50:26 INFO WebClientFactory:121 - Using connection pool strategy 2024-10-10 02:50:26 INFO WebClientFactory:218 - Creating custom connection provider 2024-10-10 02:50:27 INFO WebClientFactory:196 - Client session id: 97f5e44e-16ee-4b82-a84d-382b24b13415 2024-10-10 02:50:27 INFO WebClientFactory:209 - Client name: null 2024-10-10 02:50:27 INFO JobsScheduler:144 - Fetching task list based on the job type: DATA_LAYOUT_STRATEGY_EXECUTION 2024-10-10 02:50:28 INFO OperationTasksBuilder:67 - Fetched metadata for 2 data layout strategies 2024-10-10 02:50:28 INFO OperationTasksBuilder:82 - Max compute cost budget: 100.0, max strategies count: 1 2024-10-10 02:50:28 INFO OperationTasksBuilder:89 - Selected 1 strategies 2024-10-10 02:50:28 INFO OperationTasksBuilder:102 - Total estimated compute cost: 0.5000005706151327, total estimated reduced file count: 5.0 2024-10-10 02:50:28 INFO OperationTasksBuilder:121 - Found metadata TableDataLayoutMetadata(super=TableMetadata(super=Metadata(creator=openhouse), dbName=db, tableName=test1, creationTimeMs=1728518785971, isPrimary=true, isTimePartitioned=true, isClustered=false, jobExecutionProperties={}, retentionConfig=null), dataLayoutStrategy=DataLayoutStrategy(score=9.99998858771037, entropy=2.770809503016529E17, cost=0.5000005706151327, gain=5.0, config=DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400))) 2024-10-10 02:50:28 INFO JobsScheduler:155 - Submitting and running 1 jobs based on the job type: DATA_LAYOUT_STRATEGY_EXECUTION 2024-10-10 02:50:28 INFO OperationTask:67 - Launching job for TableDataLayoutMetadata(super=TableMetadata(super=Metadata(creator=openhouse), dbName=db, tableName=test1, creationTimeMs=1728518785971, isPrimary=true, isTimePartitioned=true, isClustered=false, jobExecutionProperties={}, retentionConfig=null), dataLayoutStrategy=DataLayoutStrategy(score=9.99998858771037, entropy=2.770809503016529E17, cost=0.5000005706151327, gain=5.0, config=DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400))) 2024-10-10 02:50:29 INFO OperationTask:93 - Launched a job with id DATA_LAYOUT_STRATEGY_EXECUTION_db_test1_65994c2f-6276-4dba-a05b-6dd6f78009b6 for TableDataLayoutMetadata(super=TableMetadata(super=Metadata(creator=openhouse), dbName=db, tableName=test1, creationTimeMs=1728518785971, isPrimary=true, isTimePartitioned=true, isClustered=false, jobExecutionProperties={}, retentionConfig=null), dataLayoutStrategy=DataLayoutStrategy(score=9.99998858771037, entropy=2.770809503016529E17, cost=0.5000005706151327, gain=5.0, config=DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400))) ... 2024-10-10 02:55:29 INFO OperationTask:139 - Finished job for entity TableDataLayoutMetadata(super=TableMetadata(super=Metadata(creator=openhouse), dbName=db, tableName=test1, creationTimeMs=1728518785971, isPrimary=true, isTimePartitioned=true, isClustered=false, jobExecutionProperties={}, retentionConfig=null), dataLayoutStrategy=DataLayoutStrategy(score=9.99998858771037, entropy=2.770809503016529E17, cost=0.5000005706151327, gain=5.0, config=DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400))): JobId DATA_LAYOUT_STRATEGY_EXECUTION_db_test1_65994c2f-6276-4dba-a05b-6dd6f78009b6, executionId 4, runTime 20517, queuedTime 11980, state SUCCEEDED 2024-10-10 02:55:29 INFO JobsScheduler:198 - Finishing scheduler for job type DATA_LAYOUT_STRATEGY_EXECUTION, tasks stats: 1 created, 1 succeeded, 0 cancelled (timeout), 0 failed, 0 skipped (no state) ``` For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
- Loading branch information