Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten DLO strategy object into separate compaction args #230

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

teamurko
Copy link
Collaborator

@teamurko teamurko commented Oct 11, 2024

Summary

Extract compaction config from strategy object and flatten attributes into individual arguments instead of passing json serialized strategy object. The reason is although Livy passes it to SparkSubmit correctly, our internal implementation of Livy doesn't.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Create 2 tables and generate strategies, run DLO exec scheduler

docker compose --profile with_jobs_scheduler run openhouse-jobs-scheduler - --type DATA_LAYOUT_STRATEGY_EXECUTION --cluster local --tablesURL http://openhouse-tables:8080/ --jobsURL http://openhouse-jobs:8080/ --tableMinAgeThresholdHours 0 --maxCostBudgetGbHrs 100 --maxStrategiesCount 1
...
2024-10-11 15:17:53 INFO  JobsScheduler:111 - Starting scheduler
2024-10-11 15:17:54 INFO  JobsScheduler:144 - Fetching task list based on the job type: DATA_LAYOUT_STRATEGY_EXECUTION
2024-10-11 15:17:57 INFO  OperationTasksBuilder:67 - Fetched metadata for 2 data layout strategies
2024-10-11 15:17:57 INFO  OperationTasksBuilder:82 - Max compute cost budget: 100.0, max strategies count: 1
2024-10-11 15:17:57 INFO  OperationTasksBuilder:89 - Selected 1 strategies
2024-10-11 15:17:57 INFO  OperationTasksBuilder:102 - Total estimated compute cost: 0.5000005706151327, total estimated reduced file count: 5.0
2024-10-11 15:17:57 INFO  OperationTasksBuilder:121 - Found metadata TableDataLayoutMetadata(super=TableMetadata(super=Metadata(creator=openhouse), dbName=db, tableName=test1, creationTimeMs=1728518785971, isPrimary=true, isTimePartitioned=true, isClustered=false, jobExecutionProperties={}, retentionConfig=null), dataLayoutStrategy=DataLayoutStrategy(score=9.99998858771037, entropy=2.770809503016529E17, cost=0.5000005706151327, gain=5.0, config=DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400)))
2024-10-11 15:17:57 INFO  JobsScheduler:155 - Submitting and running 1 jobs based on the job type: DATA_LAYOUT_STRATEGY_EXECUTION
2024-10-11 15:17:57 INFO  OperationTask:67 - Launching job for TableDataLayoutMetadata(super=TableMetadata(super=Metadata(creator=openhouse), dbName=db, tableName=test1, creationTimeMs=1728518785971, isPrimary=true, isTimePartitioned=true, isClustered=false, jobExecutionProperties={}, retentionConfig=null), dataLayoutStrategy=DataLayoutStrategy(score=9.99998858771037, entropy=2.770809503016529E17, cost=0.5000005706151327, gain=5.0, config=DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400)))
2024-10-11 15:18:02 INFO  OperationTask:93 - Launched a job with id DATA_LAYOUT_STRATEGY_EXECUTION_db_test1_7ce41985-c93b-4ede-a247-77825b7897a7 for TableDataLayoutMetadata(super=TableMetadata(super=Metadata(creator=openhouse), dbName=db, tableName=test1, creationTimeMs=1728518785971, isPrimary=true, isTimePartitioned=true, isClustered=false, jobExecutionProperties={}, retentionConfig=null), dataLayoutStrategy=DataLayoutStrategy(score=9.99998858771037, entropy=2.770809503016529E17, cost=0.5000005706151327, gain=5.0, config=DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400)))
...

Check the Spark app logs:

...
2024-10-11 08:18:18 2024-10-11 15:18:18,606 INFO utils.LineBufferedStream: 2024-10-11 15:18:18,604 INFO spark.BaseSparkApp: Session created
2024-10-11 08:18:18 2024-10-11 15:18:18,606 INFO utils.LineBufferedStream: 2024-10-11 15:18:18,606 INFO spark.BaseSparkApp: onStarted
2024-10-11 08:18:19 2024-10-11 15:18:19,952 INFO utils.LineBufferedStream: 2024-10-11 15:18:19,951 INFO spark.DataCompactionSparkApp: Rewrite data files app start for table db.test1, config DataCompactionConfig(targetByteSize=526385152, minByteSizeRatio=0.75, maxByteSizeRatio=10.0, minInputFiles=5, maxConcurrentFileGroupRewrites=5, partialProgressEnabled=true, partialProgressMaxCommits=1, maxFileGroupSizeBytes=107374182400)
...

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

@teamurko teamurko merged commit 02dc5a6 into linkedin:main Oct 11, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants