Flatten DLO strategy object into separate compaction args #230

teamurko · 2024-10-11T15:01:52Z

Summary

Extract compaction config from strategy object and flatten attributes into individual arguments instead of passing json serialized strategy object. The reason is although Livy passes it to SparkSubmit correctly, our internal implementation of Livy doesn't.

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

Create 2 tables and generate strategies, run DLO exec scheduler

docker compose --profile with_jobs_scheduler run openhouse-jobs-scheduler - --type DATA_LAYOUT_STRATEGY_EXECUTION --cluster local --tablesURL http://openhouse-tables:8080/ --jobsURL http://openhouse-jobs:8080/ --tableMinAgeThresholdHours 0 --maxCostBudgetGbHrs 100 --maxStrategiesCount 1
...
2024-10-11 15:17:53 INFO  JobsScheduler:111 - Starting scheduler
2024-10-11 15:17:54 INFO  JobsScheduler:144 - Fetching task list based on the job type: DATA_LAYOUT_STRATEGY_EXECUTION
2024-10-11 15:17:57 INFO  OperationTasksBuilder:67 - Fetched metadata for 2 data layout strategies
2024-10-11 15:17:57 INFO  OperationTasksBuilder:82 - Max compute cost budget: 100.0, max strategies count: 1
2024-10-11 15:17:57 INFO  OperationTasksBuilder:89 - Selected 1 strategies
2024-10-11 15:17:57 INFO  OperationTasksBuilder:102 - Total estimated compute cost: 0.5000005706151327, total estimated reduced file count: 5.0
2024-10-11 15:17:57 INFO  OperationTasksBuilder:121 - Found metadata TableDataLayoutMetadata(super=TableMetadata(super=Metadata(creator=openhouse), dbName=db, tableName=test1, creationTimeMs=1728518785971, isPrimary=true, isTimePartitioned=true, isClustered=false, jobExecutionProperties={}, retentionConfig=null), dataLayoutStrategy=DataLayoutStrategy(score=9.99998858771037, entropy=2.770809503016529E17, cost=0.5000005706151327, gain=5.0, config=DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400)))
2024-10-11 15:17:57 INFO  JobsScheduler:155 - Submitting and running 1 jobs based on the job type: DATA_LAYOUT_STRATEGY_EXECUTION
2024-10-11 15:17:57 INFO  OperationTask:67 - Launching job for TableDataLayoutMetadata(super=TableMetadata(super=Metadata(creator=openhouse), dbName=db, tableName=test1, creationTimeMs=1728518785971, isPrimary=true, isTimePartitioned=true, isClustered=false, jobExecutionProperties={}, retentionConfig=null), dataLayoutStrategy=DataLayoutStrategy(score=9.99998858771037, entropy=2.770809503016529E17, cost=0.5000005706151327, gain=5.0, config=DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400)))
2024-10-11 15:18:02 INFO  OperationTask:93 - Launched a job with id DATA_LAYOUT_STRATEGY_EXECUTION_db_test1_7ce41985-c93b-4ede-a247-77825b7897a7 for TableDataLayoutMetadata(super=TableMetadata(super=Metadata(creator=openhouse), dbName=db, tableName=test1, creationTimeMs=1728518785971, isPrimary=true, isTimePartitioned=true, isClustered=false, jobExecutionProperties={}, retentionConfig=null), dataLayoutStrategy=DataLayoutStrategy(score=9.99998858771037, entropy=2.770809503016529E17, cost=0.5000005706151327, gain=5.0, config=DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400)))
...

Check the Spark app logs:

...
2024-10-11 08:18:18 2024-10-11 15:18:18,606 INFO utils.LineBufferedStream: 2024-10-11 15:18:18,604 INFO spark.BaseSparkApp: Session created
2024-10-11 08:18:18 2024-10-11 15:18:18,606 INFO utils.LineBufferedStream: 2024-10-11 15:18:18,606 INFO spark.BaseSparkApp: onStarted
2024-10-11 08:18:19 2024-10-11 15:18:19,952 INFO utils.LineBufferedStream: 2024-10-11 15:18:19,951 INFO spark.DataCompactionSparkApp: Rewrite data files app start for table db.test1, config DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400)
...

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Flatten DLO strategy arg

46f93bb

teamurko assigned sumedhsakdeo, autumnust and abhisheknath2011 Oct 11, 2024

sumedhsakdeo approved these changes Oct 11, 2024

View reviewed changes

teamurko merged commit 02dc5a6 into linkedin:main Oct 11, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten DLO strategy object into separate compaction args #230

Flatten DLO strategy object into separate compaction args #230

teamurko commented Oct 11, 2024 •

edited

Loading

Flatten DLO strategy object into separate compaction args #230

Flatten DLO strategy object into separate compaction args #230

Conversation

teamurko commented Oct 11, 2024 • edited Loading

Summary

Changes

Testing Done

Additional Information

teamurko commented Oct 11, 2024 •

edited

Loading