-
Notifications
You must be signed in to change notification settings - Fork 81
fix(clp-package): Start garbage collector as part of the "all" and "controller" targets; Validate retention periods aren't configured when using the Presto query engine. #1205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughRenames garbage-collector constant to GARBAGE_COLLECTOR_COMPONENT_NAME, adds STORAGE_MANAGEMENT_COMPONENTS and includes it in ALL_COMPONENTS and TARGET_TO_COMPONENTS, adds validate_retention_config (blocks retention when query_engine is PRESTO), updates start/stop scripts and s3_utils to use the new name, and refactors garbage-collector task startup and monitoring. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI
participant ConfigLoader
participant Validator
participant Resolver
participant Launcher
participant GC as GarbageCollector
CLI->>ConfigLoader: read config & target
ConfigLoader->>Validator: validate_dataset_name(clp_config)
alt target ∈ {ALL, CONTROLLER, GARBAGE_COLLECTOR}
ConfigLoader->>Validator: validate_retention_config(clp_config)
Validator-->>ConfigLoader: ok / raises ValueError (if PRESTO + retention)
end
ConfigLoader->>Resolver: resolve components_to_start(target, flags)
Resolver-->>Launcher: components_to_start
alt GARBAGE_COLLECTOR_COMPONENT_NAME in components_to_start
Launcher->>GC: start tasks (based on retention_periods)
GC->>Launcher: report task status / errors
else
Launcher-->>GC: do not start
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
@davemarco Can you please review this? |
do we want this with presto? my instinct is yes so we should add to presto components as well |
Thanks for raising this point. Updated the code to add the component to presto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
711-718
: Type annotation bug: reducer field annotated as an instance, not a typeThe field is declared as
reducer: Reducer() = Reducer()
. The annotation must be the classReducer
, not an instance. This will break type checking and may confuse pydantic.Apply:
- reducer: Reducer() = Reducer() + reducer: Reducer = Reducer()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (1)
components/clp-py-utils/clp_py_utils/clp_config.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
51-54
: Reintroducing garbage collector to component sets — LGTMAdding STORAGE_MANAGEMENT_COMPONENTS and including it in ALL_COMPONENTS correctly restores garbage_collector as a valid component across targets and lookups.
875-880
: Include garbage_collector in PRESTO runnable set — LGTM; verified end-to-end pathsConfirmed: get_runnable_components returns COMPRESSION_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS for PRESTO, and start/stop flows will pick up the garbage_collector.
Files/locations to note:
- components/clp-py-utils/clp_py_utils/clp_config.py
- GARBAGE_COLLECTOR_NAME = "garbage_collector" (≈ line 29)
- STORAGE_MANAGEMENT_COMPONENTS = {GARBAGE_COLLECTOR_NAME} (≈ line 51)
- def get_runnable_components(self) -> Set[str]: (lines ~875–879) — returns COMPRESSION_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS when package.query_engine == QueryEngine.PRESTO
- components/clp-package-utils/clp_package_utils/scripts/start_clp.py
- clp_config.get_runnable_components() used at line ~1202; components_to_start ∩ runnable_components at ~1204
- start_garbage_collector(...) defined around line ~1064
- explicit start check for garbage_collector at ~1333
- components/clp-package-utils/clp_package_utils/scripts/stop_clp.py
- references to GARBAGE_COLLECTOR_NAME and stop handling (lines ~88, ~135–136)
No further fixes required.
def get_runnable_components(self) -> Set[str]: | ||
if QueryEngine.PRESTO == self.package.query_engine: | ||
return COMPRESSION_COMPONENTS | UI_COMPONENTS | ||
return COMPRESSION_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS | ||
else: | ||
return ALL_COMPONENTS | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Future-proof PRESTO runnable set by excluding query-only components
To avoid needing to manually union future component groups (e.g., storage-related tools) into the PRESTO branch, derive the set from ALL_COMPONENTS minus query-only components.
Apply:
STORAGE_MANAGEMENT_COMPONENTS = {GARBAGE_COLLECTOR_NAME}
ALL_COMPONENTS = (
COMPRESSION_COMPONENTS | QUERY_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS
)
+
+# Components that are not runnable when using the PRESTO query engine
+PRESTO_INCOMPATIBLE_COMPONENTS = {
+ QUERY_SCHEDULER_COMPONENT_NAME,
+ QUERY_WORKER_COMPONENT_NAME,
+ REDUCER_COMPONENT_NAME,
+}
+PRESTO_RUNNABLE_COMPONENTS = ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS
@@
def get_runnable_components(self) -> Set[str]:
if QueryEngine.PRESTO == self.package.query_engine:
- return COMPRESSION_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS
+ return PRESTO_RUNNABLE_COMPONENTS
else:
return ALL_COMPONENTS
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_runnable_components(self) -> Set[str]: | |
if QueryEngine.PRESTO == self.package.query_engine: | |
return COMPRESSION_COMPONENTS | UI_COMPONENTS | |
return COMPRESSION_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS | |
else: | |
return ALL_COMPONENTS | |
STORAGE_MANAGEMENT_COMPONENTS = {GARBAGE_COLLECTOR_NAME} | |
ALL_COMPONENTS = ( | |
COMPRESSION_COMPONENTS | QUERY_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS | |
) | |
# Components that are not runnable when using the PRESTO query engine | |
PRESTO_INCOMPATIBLE_COMPONENTS = { | |
QUERY_SCHEDULER_COMPONENT_NAME, | |
QUERY_WORKER_COMPONENT_NAME, | |
REDUCER_COMPONENT_NAME, | |
} | |
PRESTO_RUNNABLE_COMPONENTS = ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS | |
def get_runnable_components(self) -> Set[str]: | |
if QueryEngine.PRESTO == self.package.query_engine: | |
return PRESTO_RUNNABLE_COMPONENTS | |
else: | |
return ALL_COMPONENTS |
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_config.py around lines 875 to 880,
the PRESTO branch currently returns a hardcoded union of component groups which
will require manual updates as new non-query components are added; change it to
derive the runnable set by returning ALL_COMPONENTS minus the query-only
components (e.g., ALL_COMPONENTS - QUERY_ONLY_COMPONENTS) so PRESTO
automatically includes new non-query groups; ensure the result type remains
Set[str] and that QUERY_ONLY_COMPONENTS is available in this module or imported.
Also just thought should probably be in controller components as well. |
Lastly, does this garbage collector remove search results in mongoDB? The only difference for native vs. presto is that the searchJobID (collection name) are a hash instead of a counter. Will it still work? |
Oh GG, then i guess it won't work. we will need a regex to match the right collections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didnt test garbage collector, but the startup scripts now look good. I didnt retest start script after latest changes, but lmk if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
components/clp-package-utils/clp_package_utils/general.py
(2 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py
(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
QueryEngine
(88-91)CLPConfig
(703-895)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
validate_retention_config
(586-591)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)
72-72
: Importing validate_retention_config is correct and scoped to where it’s needed.The new validation belongs in startup and importing it here keeps concerns separated.
1341-1342
: Starting GC based on components_to_start is the right gate.This aligns with the centralized components resolution and avoids coupling GC startup to specific targets.
components/clp-package-utils/clp_package_utils/general.py (2)
19-19
: Re-exporting QueryEngine here is fine for public API ergonomics.This enables downstream imports from clp_package_utils.general without reaching into clp_py_utils.
586-592
: Retention check for PRESTO — keep archive restriction; allow results-cache retentionShort summary: validate_retention_config currently rejects any retention when package.query_engine == QueryEngine.PRESTO because is_retention_period_configured checks both archive and results-cache retention. This broad ban is intentional: archive GC uses CLP's QueryJobs table (see archive_garbage_collector) and cannot track external Presto queries, so allowing archive retention with PRESTO risks deleting archives Presto still needs. The original suggestion to only ban results-cache retention is incorrect.
Files to note:
- components/clp-package-utils/clp_package_utils/general.py — validate_retention_config (L586–L592) and is_retention_period_configured (L594–L601)
- components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py — _get_archive_safe_expiry_epoch / _collect_and_sweep_expired_archives (GC logic depends on QueryJobs)
- components/clp-py-utils/clp_py_utils/clp_config.py — Package / QueryEngine compatibility and runnable components (PRESTO excludes CLP query components)
Recommended change (scope the error to archive retention; allow results-cache retention and optionally add a user-facing warning):
-def validate_retention_config(clp_config: CLPConfig) -> None: - clp_query_engine = clp_config.package.query_engine - if is_retention_period_configured(clp_config) and clp_query_engine == QueryEngine.PRESTO: - raise ValueError( - f"Retention control is not supported with query_engine `{clp_query_engine}`" - ) +def validate_retention_config(clp_config: CLPConfig) -> None: + clp_query_engine = clp_config.package.query_engine + # PRESTO is incompatible with archive retention because archive GC only considers CLP's + # QueryJobs (it cannot see external Presto queries). Disallow archive retention with PRESTO. + if ( + clp_query_engine == QueryEngine.PRESTO + and clp_config.archive_output.retention_period is not None + ): + raise ValueError( + f"Archive retention is not supported with query_engine `{clp_query_engine}`" + ) + # Results-cache retention (Mongo) can be allowed with PRESTO; consider adding a warning + # to make this behaviour explicit to users if desired.Would you like me to also add a short warning (not an error) when results-cache.retention_period is set with PRESTO to make the behaviour explicit to users?
Likely an incorrect or invalid review comment.
ALL_TARGET_NAME, | ||
CONTROLLER_TARGET_NAME, | ||
GARBAGE_COLLECTOR_NAME, | ||
): | ||
validate_retention_config(clp_config) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Retention validation correctly blocks PRESTO + retention at startup; consider narrowing scope to results-cache retention.
This will fail fast when PRESTO is configured with a retention period, matching the stated validation. If archive retention (archives/streams) is safe to run with PRESTO, you may want to restrict the check to results-cache retention only to avoid over-blocking legitimate configs.
If you want to narrow the check, update the validator in general.py (see my suggestion there) and keep this call site unchanged.
To double-check the target/component wiring on this branch (so GC gets included under CONTROLLER and ALL), you can run:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Confirm STORAGE_MANAGEMENT_COMPONENTS and GC membership"
rg -n -C3 '\bSTORAGE_MANAGEMENT_COMPONENTS\b|GARBAGE_COLLECTOR_NAME' --type=py
echo
echo "Confirm TARGET_TO_COMPONENTS includes STORAGE_MANAGEMENT_COMPONENTS for CONTROLLER and ALL"
rg -n -C3 --type=py 'TARGET_TO_COMPONENTS'
rg -n -C3 --type=py 'CONTROLLER_TARGET_NAME.*STORAGE_MANAGEMENT_COMPONENTS'
rg -n -C3 --type=py 'ALL_TARGET_NAME.*STORAGE_MANAGEMENT_COMPONENTS'
Length of output: 27689
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Search for validate_retention_config definition and references"
rg -n --type=py 'def validate_retention_config|validate_retention_config\(' -C3 || true
echo
echo "Show definition (if found)"
DEF_FILE=$(rg -n --type=py 'def validate_retention_config' -S -o | awk -F: '{print $1; exit}' || true)
if [ -n "$DEF_FILE" ]; then
echo "Definition file: $DEF_FILE"
sed -n '1,240p' "$DEF_FILE"
else
echo "No direct definition found; show all matches instead:"
rg -n --type=py 'validate_retention_config' -C5 || true
fi
Length of output: 8972
Narrow PRESTO + retention validation to results-cache retention (fix required)
validate_retention_config currently rejects startup whenever any retention is configured AND package.query_engine == QueryEngine.PRESTO — this will block PRESTO even if only archive retention is set. Restrict the check to results-cache retention to avoid over-blocking.
-
Files to change
- components/clp-package-utils/clp_package_utils/general.py — function validate_retention_config (around line ~586). Current condition:
- if is_retention_period_configured(clp_config) and clp_query_engine == QueryEngine.PRESTO: raise ValueError(...)
- Call site (leave unchanged):
- components/clp-package-utils/clp_package_utils/scripts/start_clp.py — call at lines ~1256-1261: validate_retention_config(clp_config)
- components/clp-package-utils/clp_package_utils/general.py — function validate_retention_config (around line ~586). Current condition:
-
Suggested fix (replace the broad test with a results-cache specific check):
- Replace
- if is_retention_period_configured(clp_config) and clp_query_engine == QueryEngine.PRESTO:
- With
- if clp_query_engine == QueryEngine.PRESTO and clp_config.results_cache.retention_period is not None:
- Keep the existing error message or update it to mention results-cache explicitly.
- Replace
I verified:
- validate_retention_config is defined in components/clp-package-utils/clp_package_utils/general.py and uses is_retention_period_configured + QueryEngine.PRESTO.
- start_clp.py calls validate_retention_config for ALL / CONTROLLER / GARBAGE_COLLECTOR targets (so the check is exercised when GC is included).
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around line ~586,
the validate_retention_config function currently blocks startup for any
configured retention when clp_query_engine == QueryEngine.PRESTO; narrow this to
only check results-cache retention by changing the condition to test
clp_query_engine == QueryEngine.PRESTO AND
clp_config.results_cache.retention_period is not None, and keep (or update to
mention results-cache) the existing ValueError message so only results-cache
retention prevents PRESTO from starting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about:
fix(clp-package): Start garbage collector as part of the "all" and "controller" targets; Validate retention periods aren't configured when using the Presto query engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (2)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
98-109
: Update docstring to include garbage collectorgenerate_container_auth_options now supports the garbage collector, but the docstring still mentions only compression/log_viewer/query. Update for accuracy.
- :param component_type: Type of calling container (compression, log_viewer, or query). + :param component_type: Type of calling container (compression, webui/log_viewer, query, or garbage_collector).components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (1)
65-76
: Optional: Exit gracefully when no GC tasks are configuredWhile start_clp skips launching GC if retention isn’t configured, guarding direct execution helps. If no tasks are created, consider logging an info and exiting 0.
Example:
# After creating tasks if not gc_tasks: logger.info("No garbage collector tasks configured; exiting.") return 0
♻️ Duplicate comments (2)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
51-54
: Critical: PRESTO runnable set currently excludes storage management (GC) — GC won’t start under PRESTOBy adding STORAGE_MANAGEMENT_COMPONENTS to ALL_COMPONENTS without updating get_runnable_components for PRESTO, the PRESTO branch still returns COMPRESSION_COMPONENTS | UI_COMPONENTS, which drops GC at runtime (components_to_start ∩ runnable_components). This contradicts the PR discussion to include GC for PRESTO.
Recommend deriving PRESTO_RUNNABLE_COMPONENTS from ALL_COMPONENTS by excluding query-only components (future-proof and avoids manual unions), as previously suggested.
Apply near this block:
STORAGE_MANAGEMENT_COMPONENTS = {GARBAGE_COLLECTOR_COMPONENT_NAME} ALL_COMPONENTS = ( COMPRESSION_COMPONENTS | QUERY_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS ) + +# Components that are not runnable when using the PRESTO query engine +PRESTO_INCOMPATIBLE_COMPONENTS = { + QUERY_SCHEDULER_COMPONENT_NAME, + QUERY_WORKER_COMPONENT_NAME, + REDUCER_COMPONENT_NAME, +} +PRESTO_RUNNABLE_COMPONENTS = ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTSAnd update get_runnable_components (shown here as a code snippet since it’s outside the changed hunk):
def get_runnable_components(self) -> Set[str]: if QueryEngine.PRESTO == self.package.query_engine: return PRESTO_RUNNABLE_COMPONENTS return ALL_COMPONENTSTo confirm the current PRESTO set is excluding storage management today:
#!/bin/bash # Show current implementation of get_runnable_components and PRESTO branch rg -n -C2 --type=py $'def get_runnable_components\\b|QueryEngine\\.PRESTO' components/clp-py-utils/clp_py_utils/clp_config.pycomponents/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
1256-1260
: Narrow PRESTO + retention validation to results-cache retention only (avoid over-blocking)Current validate_retention_config blocks PRESTO whenever any retention is configured. If archive retention is safe with PRESTO, restrict the check to results_cache.retention_period to avoid rejecting legitimate PRESTO + archive-retention configs.
Proposed change (in components/clp-package-utils/clp_package_utils/general.py):
def validate_retention_config(clp_config: CLPConfig) -> None: clp_query_engine = clp_config.package.query_engine if clp_query_engine == QueryEngine.PRESTO and clp_config.results_cache.retention_period is not None: raise ValueError( f"Results-cache retention is not supported with query_engine `{clp_query_engine}`" )Please confirm intended scope.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
(7 hunks)components/clp-package-utils/clp_package_utils/scripts/stop_clp.py
(3 hunks)components/clp-py-utils/clp_py_utils/clp_config.py
(3 hunks)components/clp-py-utils/clp_py_utils/s3_utils.py
(2 hunks)components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (3)
components/clp-py-utils/clp_py_utils/clp_logging.py (1)
get_logger
(17-25)components/clp-package-utils/clp_package_utils/scripts/stop_clp.py (1)
main
(58-211)components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
main
(1156-1351)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
components/clp-package-utils/clp_package_utils/general.py (2)
validate_retention_config
(586-591)validate_output_storage_config
(503-508)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (13)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
29-29
: Consistent constant rename to GARBAGE_COLLECTOR_COMPONENT_NAMEThe rename is clear and consistent with other component-name constants. Good improvement in clarity.
66-68
: Controller now includes storage-management components (GC) — LGTMAdding STORAGE_MANAGEMENT_COMPONENTS to TARGET_TO_COMPONENTS[CONTROLLER_TARGET_NAME] aligns controller orchestration with GC ownership.
components/clp-py-utils/clp_py_utils/s3_utils.py (2)
17-17
: Updated import to GARBAGE_COLLECTOR_COMPONENT_NAME — LGTMImport aligns with the renamed constant in clp_config.
122-129
: Including GC in the component check is correctGC needs access to both archive and stream storages; this inclusion ensures correct AWS auth options and mounts are computed.
components/clp-package-utils/clp_package_utils/scripts/stop_clp.py (2)
14-15
: Import rename to GARBAGE_COLLECTOR_COMPONENT_NAME — LGTM
88-89
: Add GC as a valid subcommand — LGTMThis makes GC manageable via stop_clp.
components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (3)
27-27
: Use of component-name constant for logger naming — LGTMConsistent with repo-wide rename and helps correlate logs to the component.
31-33
: CLI description uses new component name — LGTM
45-51
: Robust config parsing and error handling — LGTMCatching ValidationError separately improves UX with actionable errors. Generic exception path is appropriate for unknown failures.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (4)
25-26
: Import rename to GARBAGE_COLLECTOR_COMPONENT_NAME — LGTM
1183-1184
: GC added as a valid subcommand — LGTM
1249-1254
: Including GC in output storage validation — LGTMGC needs valid archive/stream output configs; this validation is appropriate.
1212-1222
: Do not remove GARBAGE_COLLECTOR_COMPONENT_NAME — the archive GC uses the SQL metadata DBArchive garbage collection uses the SQL metadata DB (queries/deletes archive metadata via SQL_Adapter / clp_metadata_db_utils), so DB credentials are required when starting the garbage_collector component.
Relevant evidence:
- components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py — imports clp_py_utils.clp_metadata_db_utils and SQL_Adapter; _collect_and_sweep_expired_archives / _delete_expired_archives execute SQL queries and open DB connections.
- components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py — uses pymongo (MongoDB) for search-result GC (no SQL).
- components/clp-package-utils/clp_package_utils/scripts/start_clp.py — validate_and_load_db_credentials_file(...) is invoked when the target tuple includes GARBAGE_COLLECTOR_COMPONENT_NAME (around lines ~1214–1222).
Conclusion: keep GARBAGE_COLLECTOR_COMPONENT_NAME in the credentials-validation tuple. If you want to permit starting only the search-result GC without DB creds, implement a conditional check that skips DB validation when archive retention is disabled and only search-result GC is scheduled.
Likely an incorrect or invalid review comment.
if GARBAGE_COLLECTOR_COMPONENT_NAME in components_to_start: | ||
start_garbage_collector(instance_id, clp_config, container_clp_config, mounts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
GC start is correctly gated by components_to_start, but PRESTO currently filters GC out
Given the current implementation of get_runnable_components, PRESTO returns COMPRESSION_COMPONENTS | UI_COMPONENTS, excluding STORAGE_MANAGEMENT_COMPONENTS. That means GC won’t start for PRESTO even when targeted via CONTROLLER or ALL.
This is addressed by my earlier suggestion in clp_config.py to compute PRESTO_RUNNABLE_COMPONENTS = ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "1) locate clp_config.py"
rg -n --hidden --no-ignore "clp_config.py" || true
echo
echo "2) search for get_runnable_components and PRESTO-related symbols"
rg -n -C3 --hidden --no-ignore "get_runnable_components" || true
rg -n -C3 --hidden --no-ignore "PRESTO_" || true
rg -n -C3 --hidden --no-ignore "STORAGE_MANAGEMENT_COMPONENTS" || true
rg -n -C3 --hidden --no-ignore "COMPRESSION_COMPONENTS" || true
rg -n -C3 --hidden --no-ignore "UI_COMPONENTS" || true
rg -n -C3 --hidden --no-ignore "GARBAGE_COLLECTOR_COMPONENT_NAME" || true
echo
echo "3) show files that define PRESTO runnable/component groups (if found)"
rg -n --hidden --no-ignore "PRESTO" -g '!**/node_modules/**' --files-with-matches | sort -u || true
Length of output: 42289
GC start is correctly gated by components_to_start, but PRESTO currently filters GC out
Verified: clp_config.get_runnable_components returns COMPRESSION_COMPONENTS | UI_COMPONENTS when QueryEngine.PRESTO, which excludes STORAGE_MANAGEMENT_COMPONENTS (garbage_collector). start_clp.py intersects target components with runnable_components before starting, so garbage_collector will not be started for PRESTO.
Please update to compute PRESTO runnable components from ALL_COMPONENTS (so GC can be included when appropriate):
-
Files to inspect/fix:
- components/clp-py-utils/clp_py_utils/clp_config.py
- STORAGE_MANAGEMENT_COMPONENTS / ALL_COMPONENTS (around lines ~51–54)
- get_runnable_components (around lines ~876–880)
- components/clp-package-utils/clp_package_utils/scripts/start_clp.py
- intersection of components_to_start with runnable_components (around lines ~1203–1205)
- GC start call (lines ~1341–1342) — confirmation that gating is correct here
- components/clp-py-utils/clp_py_utils/clp_config.py
-
Suggested change (example patch) to clp_config.py:
# define which components PRESTO cannot run (adjust the set as appropriate)
PRESTO_INCOMPATIBLE_COMPONENTS = {
# e.g. QUERY_COMPONENTS or other components known to be incompatible
}
PRESTO_RUNNABLE_COMPONENTS = ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS
def get_runnable_components(self) -> Set[str]:
if QueryEngine.PRESTO == self.package.query_engine:
return PRESTO_RUNNABLE_COMPONENTS
else:
return ALL_COMPONENTS
Reason: keep the source-of-truth as ALL_COMPONENTS and subtract explicit PRESTO-incompatible pieces to avoid accidentally excluding STORAGE_MANAGEMENT_COMPONENTS (garbage_collector).
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_config.py around lines ~51–54 and
~876–880, change the PRESTO handling so runnable components are computed from
ALL_COMPONENTS minus an explicit PRESTO_INCOMPATIBLE_COMPONENTS set (i.e.,
define PRESTO_INCOMPATIBLE_COMPONENTS and set PRESTO_RUNNABLE_COMPONENTS =
ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS), and update
get_runnable_components to return PRESTO_RUNNABLE_COMPONENTS when query_engine
is PRESTO (otherwise ALL_COMPONENTS); also verify
STORAGE_MANAGEMENT_COMPONENTS/ALL_COMPONENTS definitions include
garbage_collector. In
components/clp-package-utils/clp_package_utils/scripts/start_clp.py around lines
~1203–1205 and confirm lines 1341–1342, keep the intersection of
components_to_start with runnable_components (no change) so that after the
config fix garbage_collector can be included, and ensure the existing gating at
the GC start call remains intact.
if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_COMPONENT_NAME): | ||
container_name = f"clp-{GARBAGE_COLLECTOR_COMPONENT_NAME}-{instance_id}" | ||
stop_running_container(container_name, already_exited_containers, force) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Clean up the GC container config file after stopping (consistency with other components)
Other components (e.g., reducer, schedulers) remove their generated config files from logs_dir after stopping. Do the same for GC to avoid stale files.
- if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_COMPONENT_NAME):
- container_name = f"clp-{GARBAGE_COLLECTOR_COMPONENT_NAME}-{instance_id}"
- stop_running_container(container_name, already_exited_containers, force)
+ if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_COMPONENT_NAME):
+ container_name = f"clp-{GARBAGE_COLLECTOR_COMPONENT_NAME}-{instance_id}"
+ stop_running_container(container_name, already_exited_containers, force)
+
+ container_config_file_path = logs_dir / f"{container_name}.yml"
+ if container_config_file_path.exists():
+ container_config_file_path.unlink()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_COMPONENT_NAME): | |
container_name = f"clp-{GARBAGE_COLLECTOR_COMPONENT_NAME}-{instance_id}" | |
stop_running_container(container_name, already_exited_containers, force) | |
if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_COMPONENT_NAME): | |
container_name = f"clp-{GARBAGE_COLLECTOR_COMPONENT_NAME}-{instance_id}" | |
stop_running_container(container_name, already_exited_containers, force) | |
container_config_file_path = logs_dir / f"{container_name}.yml" | |
if container_config_file_path.exists(): | |
container_config_file_path.unlink() |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/stop_clp.py around
lines 135 to 137, after stopping the GC container you need to also remove its
generated config file from logs_dir to match other components; locate the GC
config filename using the same naming convention used elsewhere (component name
+ instance id, e.g. GARBAGE_COLLECTOR_COMPONENT_NAME and instance_id) or derive
it from container_name, build the full path under logs_dir, check if the file
exists and unlink it, and wrap the remove in a try/except that logs a
debug/error on failure so stale GC config files are cleaned up safely.
Description
This PR fixes an issue introduced by #1035 where the garbage collector was not properly merged to use new components list feature.
We decided to not add garbage collector to presto because:
This PR also updated
GARBAGE_COLLECTOR_NAME
toGARBAGE_COLLECTOR_COMPONENT_NAME
to follow the naming conventionChecklist
breaking change.
Validation performed
Confirmed that all the following commands launch garbage collector:
Confirmed that when using presto as query_engine, specfying retention period results in an error when running start-clp.sh
Summary by CodeRabbit
New Features
Behaviour Changes