Skip to content

Conversation

haiqi96
Copy link
Contributor

@haiqi96 haiqi96 commented Aug 14, 2025

Description

This PR fixes an issue introduced by #1035 where the garbage collector was not properly merged to use new components list feature.

We decided to not add garbage collector to presto because:

  1. The search results collection doesn't follow the integer convention
  2. We don't have the race condition guard in presto.

This PR also updated GARBAGE_COLLECTOR_NAME to GARBAGE_COLLECTOR_COMPONENT_NAME to follow the naming convention

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Confirmed that all the following commands launch garbage collector:

  • sbin/start-clp.sh
  • sbin/start-clp.sh garbage_collector
    • sbin/start-clp.sh controller
      Confirmed that when using presto as query_engine, specfying retention period results in an error when running start-clp.sh

Summary by CodeRabbit

  • New Features

    • Added a Storage Management (Garbage Collector) component and included it in the global component set.
  • Behaviour Changes

    • Startup now starts the Garbage Collector when that component is selected; stop/start CLI paths align with the component selection.
    • Retention configuration is validated at startup and is blocked when using the PRESTO query engine.
    • Improved runtime monitoring and logging for garbage-collector tasks.

Copy link
Contributor

coderabbitai bot commented Aug 14, 2025

Walkthrough

Renames garbage-collector constant to GARBAGE_COLLECTOR_COMPONENT_NAME, adds STORAGE_MANAGEMENT_COMPONENTS and includes it in ALL_COMPONENTS and TARGET_TO_COMPONENTS, adds validate_retention_config (blocks retention when query_engine is PRESTO), updates start/stop scripts and s3_utils to use the new name, and refactors garbage-collector task startup and monitoring.

Changes

Cohort / File(s) Summary of Changes
Config constants & sets
components/clp-py-utils/clp_py_utils/clp_config.py
Renamed GARBAGE_COLLECTOR_NAMEGARBAGE_COLLECTOR_COMPONENT_NAME; added STORAGE_MANAGEMENT_COMPONENTS = {GARBAGE_COLLECTOR_COMPONENT_NAME}; updated ALL_COMPONENTS and TARGET_TO_COMPONENTS to include storage-management components.
Startup script & CLI
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
Switched imports/usages to GARBAGE_COLLECTOR_COMPONENT_NAME; import and invoke validate_retention_config; gate GC startup by membership in resolved components_to_start and log retention state.
Shutdown script
components/clp-package-utils/clp_package_utils/scripts/stop_clp.py
Switched usages to GARBAGE_COLLECTOR_COMPONENT_NAME for parser targets, membership checks, and container naming (clp-{component}-{instance_id}).
General utilities & validation
components/clp-package-utils/clp_package_utils/general.py
Added validate_retention_config(clp_config: CLPConfig) which raises when retention is configured with QueryEngine.PRESTO; re-exported QueryEngine from clp_config.
S3 helpers
components/clp-py-utils/clp_py_utils/s3_utils.py
Updated import and membership checks to use GARBAGE_COLLECTOR_COMPONENT_NAME instead of the old name.
Job orchestration: garbage collector
components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py
Updated imports/logging to use GARBAGE_COLLECTOR_COMPONENT_NAME; added robust config parse error handling; build GC tasks from config retention_periods, skip when None; improved task-start logging and wait-loop monitoring with explicit error/termination handling.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant ConfigLoader
    participant Validator
    participant Resolver
    participant Launcher
    participant GC as GarbageCollector

    CLI->>ConfigLoader: read config & target
    ConfigLoader->>Validator: validate_dataset_name(clp_config)
    alt target ∈ {ALL, CONTROLLER, GARBAGE_COLLECTOR}
        ConfigLoader->>Validator: validate_retention_config(clp_config)
        Validator-->>ConfigLoader: ok / raises ValueError (if PRESTO + retention)
    end
    ConfigLoader->>Resolver: resolve components_to_start(target, flags)
    Resolver-->>Launcher: components_to_start
    alt GARBAGE_COLLECTOR_COMPONENT_NAME in components_to_start
        Launcher->>GC: start tasks (based on retention_periods)
        GC->>Launcher: report task status / errors
    else
        Launcher-->>GC: do not start
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • davemarco
  • LinZhihao-723

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@haiqi96 haiqi96 marked this pull request as ready for review August 14, 2025 18:29
@haiqi96 haiqi96 requested a review from a team as a code owner August 14, 2025 18:29
@haiqi96
Copy link
Contributor Author

haiqi96 commented Aug 14, 2025

@davemarco Can you please review this?

@davemarco
Copy link
Contributor

davemarco commented Aug 14, 2025

do we want this with presto? my instinct is yes so we should add to presto components as well

@haiqi96
Copy link
Contributor Author

haiqi96 commented Aug 14, 2025

do we want this with presto? my instinct is yes so we should add to presto components as well

Thanks for raising this point. Updated the code to add the component to presto

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)

711-718: Type annotation bug: reducer field annotated as an instance, not a type

The field is declared as reducer: Reducer() = Reducer(). The annotation must be the class Reducer, not an instance. This will break type checking and may confuse pydantic.

Apply:

-    reducer: Reducer() = Reducer()
+    reducer: Reducer = Reducer()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d5ffb33 and 0f556b5.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/clp_config.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/clp_config.py (2)

51-54: Reintroducing garbage collector to component sets — LGTM

Adding STORAGE_MANAGEMENT_COMPONENTS and including it in ALL_COMPONENTS correctly restores garbage_collector as a valid component across targets and lookups.


875-880: Include garbage_collector in PRESTO runnable set — LGTM; verified end-to-end paths

Confirmed: get_runnable_components returns COMPRESSION_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS for PRESTO, and start/stop flows will pick up the garbage_collector.

Files/locations to note:

  • components/clp-py-utils/clp_py_utils/clp_config.py
    • GARBAGE_COLLECTOR_NAME = "garbage_collector" (≈ line 29)
    • STORAGE_MANAGEMENT_COMPONENTS = {GARBAGE_COLLECTOR_NAME} (≈ line 51)
    • def get_runnable_components(self) -> Set[str]: (lines ~875–879) — returns COMPRESSION_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS when package.query_engine == QueryEngine.PRESTO
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
    • clp_config.get_runnable_components() used at line ~1202; components_to_start ∩ runnable_components at ~1204
    • start_garbage_collector(...) defined around line ~1064
    • explicit start check for garbage_collector at ~1333
  • components/clp-package-utils/clp_package_utils/scripts/stop_clp.py
    • references to GARBAGE_COLLECTOR_NAME and stop handling (lines ~88, ~135–136)

No further fixes required.

Comment on lines 875 to 880
def get_runnable_components(self) -> Set[str]:
if QueryEngine.PRESTO == self.package.query_engine:
return COMPRESSION_COMPONENTS | UI_COMPONENTS
return COMPRESSION_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS
else:
return ALL_COMPONENTS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Future-proof PRESTO runnable set by excluding query-only components

To avoid needing to manually union future component groups (e.g., storage-related tools) into the PRESTO branch, derive the set from ALL_COMPONENTS minus query-only components.

Apply:

 STORAGE_MANAGEMENT_COMPONENTS = {GARBAGE_COLLECTOR_NAME}
 ALL_COMPONENTS = (
     COMPRESSION_COMPONENTS | QUERY_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS
 )
+
+# Components that are not runnable when using the PRESTO query engine
+PRESTO_INCOMPATIBLE_COMPONENTS = {
+    QUERY_SCHEDULER_COMPONENT_NAME,
+    QUERY_WORKER_COMPONENT_NAME,
+    REDUCER_COMPONENT_NAME,
+}
+PRESTO_RUNNABLE_COMPONENTS = ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS
@@
     def get_runnable_components(self) -> Set[str]:
         if QueryEngine.PRESTO == self.package.query_engine:
-            return COMPRESSION_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS
+            return PRESTO_RUNNABLE_COMPONENTS
         else:
             return ALL_COMPONENTS
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_runnable_components(self) -> Set[str]:
if QueryEngine.PRESTO == self.package.query_engine:
return COMPRESSION_COMPONENTS | UI_COMPONENTS
return COMPRESSION_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS
else:
return ALL_COMPONENTS
STORAGE_MANAGEMENT_COMPONENTS = {GARBAGE_COLLECTOR_NAME}
ALL_COMPONENTS = (
COMPRESSION_COMPONENTS | QUERY_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS
)
# Components that are not runnable when using the PRESTO query engine
PRESTO_INCOMPATIBLE_COMPONENTS = {
QUERY_SCHEDULER_COMPONENT_NAME,
QUERY_WORKER_COMPONENT_NAME,
REDUCER_COMPONENT_NAME,
}
PRESTO_RUNNABLE_COMPONENTS = ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS
def get_runnable_components(self) -> Set[str]:
if QueryEngine.PRESTO == self.package.query_engine:
return PRESTO_RUNNABLE_COMPONENTS
else:
return ALL_COMPONENTS
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_config.py around lines 875 to 880,
the PRESTO branch currently returns a hardcoded union of component groups which
will require manual updates as new non-query components are added; change it to
derive the runnable set by returning ALL_COMPONENTS minus the query-only
components (e.g., ALL_COMPONENTS - QUERY_ONLY_COMPONENTS) so PRESTO
automatically includes new non-query groups; ensure the result type remains
Set[str] and that QUERY_ONLY_COMPONENTS is available in this module or imported.

@davemarco
Copy link
Contributor

Also just thought should probably be in controller components as well.

@davemarco
Copy link
Contributor

davemarco commented Aug 14, 2025

Lastly, does this garbage collector remove search results in mongoDB? The only difference for native vs. presto is that the searchJobID (collection name) are a hash instead of a counter. Will it still work?

@haiqi96
Copy link
Contributor Author

haiqi96 commented Aug 14, 2025

Lastly, does this garbage collector remove search results in mongoDB? The only difference for native vs. presto is that the searchJobID (collection name) are a hash instead of a counter. Will it still work?

Oh GG, then i guess it won't work. we will need a regex to match the right collections

@davemarco davemarco self-requested a review August 15, 2025 00:33
davemarco
davemarco previously approved these changes Aug 15, 2025
Copy link
Contributor

@davemarco davemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didnt test garbage collector, but the startup scripts now look good. I didnt retest start script after latest changes, but lmk if needed

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7197b09 and 21e254d.

📒 Files selected for processing (2)
  • components/clp-package-utils/clp_package_utils/general.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
  • QueryEngine (88-91)
  • CLPConfig (703-895)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
  • validate_retention_config (586-591)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)

72-72: Importing validate_retention_config is correct and scoped to where it’s needed.

The new validation belongs in startup and importing it here keeps concerns separated.


1341-1342: Starting GC based on components_to_start is the right gate.

This aligns with the centralized components resolution and avoids coupling GC startup to specific targets.

components/clp-package-utils/clp_package_utils/general.py (2)

19-19: Re-exporting QueryEngine here is fine for public API ergonomics.

This enables downstream imports from clp_package_utils.general without reaching into clp_py_utils.


586-592: Retention check for PRESTO — keep archive restriction; allow results-cache retention

Short summary: validate_retention_config currently rejects any retention when package.query_engine == QueryEngine.PRESTO because is_retention_period_configured checks both archive and results-cache retention. This broad ban is intentional: archive GC uses CLP's QueryJobs table (see archive_garbage_collector) and cannot track external Presto queries, so allowing archive retention with PRESTO risks deleting archives Presto still needs. The original suggestion to only ban results-cache retention is incorrect.

Files to note:

  • components/clp-package-utils/clp_package_utils/general.py — validate_retention_config (L586–L592) and is_retention_period_configured (L594–L601)
  • components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py — _get_archive_safe_expiry_epoch / _collect_and_sweep_expired_archives (GC logic depends on QueryJobs)
  • components/clp-py-utils/clp_py_utils/clp_config.py — Package / QueryEngine compatibility and runnable components (PRESTO excludes CLP query components)

Recommended change (scope the error to archive retention; allow results-cache retention and optionally add a user-facing warning):

-def validate_retention_config(clp_config: CLPConfig) -> None:
-    clp_query_engine = clp_config.package.query_engine
-    if is_retention_period_configured(clp_config) and clp_query_engine == QueryEngine.PRESTO:
-        raise ValueError(
-            f"Retention control is not supported with query_engine `{clp_query_engine}`"
-        )
+def validate_retention_config(clp_config: CLPConfig) -> None:
+    clp_query_engine = clp_config.package.query_engine
+    # PRESTO is incompatible with archive retention because archive GC only considers CLP's
+    # QueryJobs (it cannot see external Presto queries). Disallow archive retention with PRESTO.
+    if (
+        clp_query_engine == QueryEngine.PRESTO
+        and clp_config.archive_output.retention_period is not None
+    ):
+        raise ValueError(
+            f"Archive retention is not supported with query_engine `{clp_query_engine}`"
+        )
+    # Results-cache retention (Mongo) can be allowed with PRESTO; consider adding a warning
+    # to make this behaviour explicit to users if desired.

Would you like me to also add a short warning (not an error) when results-cache.retention_period is set with PRESTO to make the behaviour explicit to users?

Likely an incorrect or invalid review comment.

Comment on lines 1256 to 1261
ALL_TARGET_NAME,
CONTROLLER_TARGET_NAME,
GARBAGE_COLLECTOR_NAME,
):
validate_retention_config(clp_config)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Retention validation correctly blocks PRESTO + retention at startup; consider narrowing scope to results-cache retention.

This will fail fast when PRESTO is configured with a retention period, matching the stated validation. If archive retention (archives/streams) is safe to run with PRESTO, you may want to restrict the check to results-cache retention only to avoid over-blocking legitimate configs.

If you want to narrow the check, update the validator in general.py (see my suggestion there) and keep this call site unchanged.

To double-check the target/component wiring on this branch (so GC gets included under CONTROLLER and ALL), you can run:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Confirm STORAGE_MANAGEMENT_COMPONENTS and GC membership"
rg -n -C3 '\bSTORAGE_MANAGEMENT_COMPONENTS\b|GARBAGE_COLLECTOR_NAME' --type=py

echo
echo "Confirm TARGET_TO_COMPONENTS includes STORAGE_MANAGEMENT_COMPONENTS for CONTROLLER and ALL"
rg -n -C3 --type=py 'TARGET_TO_COMPONENTS'
rg -n -C3 --type=py 'CONTROLLER_TARGET_NAME.*STORAGE_MANAGEMENT_COMPONENTS'
rg -n -C3 --type=py 'ALL_TARGET_NAME.*STORAGE_MANAGEMENT_COMPONENTS'

Length of output: 27689


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Search for validate_retention_config definition and references"
rg -n --type=py 'def validate_retention_config|validate_retention_config\(' -C3 || true

echo
echo "Show definition (if found)"
DEF_FILE=$(rg -n --type=py 'def validate_retention_config' -S -o | awk -F: '{print $1; exit}' || true)
if [ -n "$DEF_FILE" ]; then
  echo "Definition file: $DEF_FILE"
  sed -n '1,240p' "$DEF_FILE"
else
  echo "No direct definition found; show all matches instead:"
  rg -n --type=py 'validate_retention_config' -C5 || true
fi

Length of output: 8972


Narrow PRESTO + retention validation to results-cache retention (fix required)

validate_retention_config currently rejects startup whenever any retention is configured AND package.query_engine == QueryEngine.PRESTO — this will block PRESTO even if only archive retention is set. Restrict the check to results-cache retention to avoid over-blocking.

  • Files to change

    • components/clp-package-utils/clp_package_utils/general.py — function validate_retention_config (around line ~586). Current condition:
      • if is_retention_period_configured(clp_config) and clp_query_engine == QueryEngine.PRESTO: raise ValueError(...)
    • Call site (leave unchanged):
      • components/clp-package-utils/clp_package_utils/scripts/start_clp.py — call at lines ~1256-1261: validate_retention_config(clp_config)
  • Suggested fix (replace the broad test with a results-cache specific check):

    • Replace
      • if is_retention_period_configured(clp_config) and clp_query_engine == QueryEngine.PRESTO:
    • With
      • if clp_query_engine == QueryEngine.PRESTO and clp_config.results_cache.retention_period is not None:
    • Keep the existing error message or update it to mention results-cache explicitly.

I verified:

  • validate_retention_config is defined in components/clp-package-utils/clp_package_utils/general.py and uses is_retention_period_configured + QueryEngine.PRESTO.
  • start_clp.py calls validate_retention_config for ALL / CONTROLLER / GARBAGE_COLLECTOR targets (so the check is exercised when GC is included).
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around line ~586,
the validate_retention_config function currently blocks startup for any
configured retention when clp_query_engine == QueryEngine.PRESTO; narrow this to
only check results-cache retention by changing the condition to test
clp_query_engine == QueryEngine.PRESTO AND
clp_config.results_cache.retention_period is not None, and keep (or update to
mention results-cache) the existing ValueError message so only results-cache
retention prevents PRESTO from starting.

kirkrodrigues
kirkrodrigues previously approved these changes Aug 15, 2025
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

fix(clp-package): Start garbage collector as part of the "all" and "controller" targets; Validate retention periods aren't configured when using the Presto query engine.

@haiqi96 haiqi96 dismissed stale reviews from kirkrodrigues and davemarco via 3a3f0b6 August 15, 2025 01:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (2)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)

98-109: Update docstring to include garbage collector

generate_container_auth_options now supports the garbage collector, but the docstring still mentions only compression/log_viewer/query. Update for accuracy.

-    :param component_type: Type of calling container (compression, log_viewer, or query).
+    :param component_type: Type of calling container (compression, webui/log_viewer, query, or garbage_collector).
components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (1)

65-76: Optional: Exit gracefully when no GC tasks are configured

While start_clp skips launching GC if retention isn’t configured, guarding direct execution helps. If no tasks are created, consider logging an info and exiting 0.

Example:

# After creating tasks
if not gc_tasks:
    logger.info("No garbage collector tasks configured; exiting.")
    return 0
♻️ Duplicate comments (2)
components/clp-py-utils/clp_py_utils/clp_config.py (1)

51-54: Critical: PRESTO runnable set currently excludes storage management (GC) — GC won’t start under PRESTO

By adding STORAGE_MANAGEMENT_COMPONENTS to ALL_COMPONENTS without updating get_runnable_components for PRESTO, the PRESTO branch still returns COMPRESSION_COMPONENTS | UI_COMPONENTS, which drops GC at runtime (components_to_start ∩ runnable_components). This contradicts the PR discussion to include GC for PRESTO.

Recommend deriving PRESTO_RUNNABLE_COMPONENTS from ALL_COMPONENTS by excluding query-only components (future-proof and avoids manual unions), as previously suggested.

Apply near this block:

 STORAGE_MANAGEMENT_COMPONENTS = {GARBAGE_COLLECTOR_COMPONENT_NAME}
 ALL_COMPONENTS = (
     COMPRESSION_COMPONENTS | QUERY_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS
 )
+
+# Components that are not runnable when using the PRESTO query engine
+PRESTO_INCOMPATIBLE_COMPONENTS = {
+    QUERY_SCHEDULER_COMPONENT_NAME,
+    QUERY_WORKER_COMPONENT_NAME,
+    REDUCER_COMPONENT_NAME,
+}
+PRESTO_RUNNABLE_COMPONENTS = ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS

And update get_runnable_components (shown here as a code snippet since it’s outside the changed hunk):

def get_runnable_components(self) -> Set[str]:
    if QueryEngine.PRESTO == self.package.query_engine:
        return PRESTO_RUNNABLE_COMPONENTS
    return ALL_COMPONENTS

To confirm the current PRESTO set is excluding storage management today:

#!/bin/bash
# Show current implementation of get_runnable_components and PRESTO branch
rg -n -C2 --type=py $'def get_runnable_components\\b|QueryEngine\\.PRESTO' components/clp-py-utils/clp_py_utils/clp_config.py
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)

1256-1260: Narrow PRESTO + retention validation to results-cache retention only (avoid over-blocking)

Current validate_retention_config blocks PRESTO whenever any retention is configured. If archive retention is safe with PRESTO, restrict the check to results_cache.retention_period to avoid rejecting legitimate PRESTO + archive-retention configs.

Proposed change (in components/clp-package-utils/clp_package_utils/general.py):

def validate_retention_config(clp_config: CLPConfig) -> None:
    clp_query_engine = clp_config.package.query_engine
    if clp_query_engine == QueryEngine.PRESTO and clp_config.results_cache.retention_period is not None:
        raise ValueError(
            f"Results-cache retention is not supported with query_engine `{clp_query_engine}`"
        )

Please confirm intended scope.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 21e254d and 3a3f0b6.

📒 Files selected for processing (5)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (7 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/stop_clp.py (3 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (3 hunks)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (2 hunks)
  • components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (3)
components/clp-py-utils/clp_py_utils/clp_logging.py (1)
  • get_logger (17-25)
components/clp-package-utils/clp_package_utils/scripts/stop_clp.py (1)
  • main (58-211)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
  • main (1156-1351)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
components/clp-package-utils/clp_package_utils/general.py (2)
  • validate_retention_config (586-591)
  • validate_output_storage_config (503-508)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (13)
components/clp-py-utils/clp_py_utils/clp_config.py (2)

29-29: Consistent constant rename to GARBAGE_COLLECTOR_COMPONENT_NAME

The rename is clear and consistent with other component-name constants. Good improvement in clarity.


66-68: Controller now includes storage-management components (GC) — LGTM

Adding STORAGE_MANAGEMENT_COMPONENTS to TARGET_TO_COMPONENTS[CONTROLLER_TARGET_NAME] aligns controller orchestration with GC ownership.

components/clp-py-utils/clp_py_utils/s3_utils.py (2)

17-17: Updated import to GARBAGE_COLLECTOR_COMPONENT_NAME — LGTM

Import aligns with the renamed constant in clp_config.


122-129: Including GC in the component check is correct

GC needs access to both archive and stream storages; this inclusion ensures correct AWS auth options and mounts are computed.

components/clp-package-utils/clp_package_utils/scripts/stop_clp.py (2)

14-15: Import rename to GARBAGE_COLLECTOR_COMPONENT_NAME — LGTM


88-89: Add GC as a valid subcommand — LGTM

This makes GC manageable via stop_clp.

components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (3)

27-27: Use of component-name constant for logger naming — LGTM

Consistent with repo-wide rename and helps correlate logs to the component.


31-33: CLI description uses new component name — LGTM


45-51: Robust config parsing and error handling — LGTM

Catching ValidationError separately improves UX with actionable errors. Generic exception path is appropriate for unknown failures.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (4)

25-26: Import rename to GARBAGE_COLLECTOR_COMPONENT_NAME — LGTM


1183-1184: GC added as a valid subcommand — LGTM


1249-1254: Including GC in output storage validation — LGTM

GC needs valid archive/stream output configs; this validation is appropriate.


1212-1222: Do not remove GARBAGE_COLLECTOR_COMPONENT_NAME — the archive GC uses the SQL metadata DB

Archive garbage collection uses the SQL metadata DB (queries/deletes archive metadata via SQL_Adapter / clp_metadata_db_utils), so DB credentials are required when starting the garbage_collector component.

Relevant evidence:

  • components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py — imports clp_py_utils.clp_metadata_db_utils and SQL_Adapter; _collect_and_sweep_expired_archives / _delete_expired_archives execute SQL queries and open DB connections.
  • components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py — uses pymongo (MongoDB) for search-result GC (no SQL).
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py — validate_and_load_db_credentials_file(...) is invoked when the target tuple includes GARBAGE_COLLECTOR_COMPONENT_NAME (around lines ~1214–1222).

Conclusion: keep GARBAGE_COLLECTOR_COMPONENT_NAME in the credentials-validation tuple. If you want to permit starting only the search-result GC without DB creds, implement a conditional check that skips DB validation when archive retention is disabled and only search-result GC is scheduled.

Likely an incorrect or invalid review comment.

Comment on lines +1341 to 1342
if GARBAGE_COLLECTOR_COMPONENT_NAME in components_to_start:
start_garbage_collector(instance_id, clp_config, container_clp_config, mounts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

GC start is correctly gated by components_to_start, but PRESTO currently filters GC out

Given the current implementation of get_runnable_components, PRESTO returns COMPRESSION_COMPONENTS | UI_COMPONENTS, excluding STORAGE_MANAGEMENT_COMPONENTS. That means GC won’t start for PRESTO even when targeted via CONTROLLER or ALL.

This is addressed by my earlier suggestion in clp_config.py to compute PRESTO_RUNNABLE_COMPONENTS = ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "1) locate clp_config.py"
rg -n --hidden --no-ignore "clp_config.py" || true

echo
echo "2) search for get_runnable_components and PRESTO-related symbols"
rg -n -C3 --hidden --no-ignore "get_runnable_components" || true
rg -n -C3 --hidden --no-ignore "PRESTO_" || true
rg -n -C3 --hidden --no-ignore "STORAGE_MANAGEMENT_COMPONENTS" || true
rg -n -C3 --hidden --no-ignore "COMPRESSION_COMPONENTS" || true
rg -n -C3 --hidden --no-ignore "UI_COMPONENTS" || true
rg -n -C3 --hidden --no-ignore "GARBAGE_COLLECTOR_COMPONENT_NAME" || true

echo
echo "3) show files that define PRESTO runnable/component groups (if found)"
rg -n --hidden --no-ignore "PRESTO" -g '!**/node_modules/**' --files-with-matches | sort -u || true

Length of output: 42289


GC start is correctly gated by components_to_start, but PRESTO currently filters GC out

Verified: clp_config.get_runnable_components returns COMPRESSION_COMPONENTS | UI_COMPONENTS when QueryEngine.PRESTO, which excludes STORAGE_MANAGEMENT_COMPONENTS (garbage_collector). start_clp.py intersects target components with runnable_components before starting, so garbage_collector will not be started for PRESTO.

Please update to compute PRESTO runnable components from ALL_COMPONENTS (so GC can be included when appropriate):

  • Files to inspect/fix:

    • components/clp-py-utils/clp_py_utils/clp_config.py
      • STORAGE_MANAGEMENT_COMPONENTS / ALL_COMPONENTS (around lines ~51–54)
      • get_runnable_components (around lines ~876–880)
    • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
      • intersection of components_to_start with runnable_components (around lines ~1203–1205)
      • GC start call (lines ~1341–1342) — confirmation that gating is correct here
  • Suggested change (example patch) to clp_config.py:

# define which components PRESTO cannot run (adjust the set as appropriate)
PRESTO_INCOMPATIBLE_COMPONENTS = {
    # e.g. QUERY_COMPONENTS or other components known to be incompatible
}

PRESTO_RUNNABLE_COMPONENTS = ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS

def get_runnable_components(self) -> Set[str]:
    if QueryEngine.PRESTO == self.package.query_engine:
        return PRESTO_RUNNABLE_COMPONENTS
    else:
        return ALL_COMPONENTS

Reason: keep the source-of-truth as ALL_COMPONENTS and subtract explicit PRESTO-incompatible pieces to avoid accidentally excluding STORAGE_MANAGEMENT_COMPONENTS (garbage_collector).

🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_config.py around lines ~51–54 and
~876–880, change the PRESTO handling so runnable components are computed from
ALL_COMPONENTS minus an explicit PRESTO_INCOMPATIBLE_COMPONENTS set (i.e.,
define PRESTO_INCOMPATIBLE_COMPONENTS and set PRESTO_RUNNABLE_COMPONENTS =
ALL_COMPONENTS - PRESTO_INCOMPATIBLE_COMPONENTS), and update
get_runnable_components to return PRESTO_RUNNABLE_COMPONENTS when query_engine
is PRESTO (otherwise ALL_COMPONENTS); also verify
STORAGE_MANAGEMENT_COMPONENTS/ALL_COMPONENTS definitions include
garbage_collector. In
components/clp-package-utils/clp_package_utils/scripts/start_clp.py around lines
~1203–1205 and confirm lines 1341–1342, keep the intersection of
components_to_start with runnable_components (no change) so that after the
config fix garbage_collector can be included, and ensure the existing gating at
the GC start call remains intact.

Comment on lines +135 to 137
if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_COMPONENT_NAME):
container_name = f"clp-{GARBAGE_COLLECTOR_COMPONENT_NAME}-{instance_id}"
stop_running_container(container_name, already_exited_containers, force)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Clean up the GC container config file after stopping (consistency with other components)

Other components (e.g., reducer, schedulers) remove their generated config files from logs_dir after stopping. Do the same for GC to avoid stale files.

-        if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_COMPONENT_NAME):
-            container_name = f"clp-{GARBAGE_COLLECTOR_COMPONENT_NAME}-{instance_id}"
-            stop_running_container(container_name, already_exited_containers, force)
+        if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_COMPONENT_NAME):
+            container_name = f"clp-{GARBAGE_COLLECTOR_COMPONENT_NAME}-{instance_id}"
+            stop_running_container(container_name, already_exited_containers, force)
+
+            container_config_file_path = logs_dir / f"{container_name}.yml"
+            if container_config_file_path.exists():
+                container_config_file_path.unlink()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_COMPONENT_NAME):
container_name = f"clp-{GARBAGE_COLLECTOR_COMPONENT_NAME}-{instance_id}"
stop_running_container(container_name, already_exited_containers, force)
if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_COMPONENT_NAME):
container_name = f"clp-{GARBAGE_COLLECTOR_COMPONENT_NAME}-{instance_id}"
stop_running_container(container_name, already_exited_containers, force)
container_config_file_path = logs_dir / f"{container_name}.yml"
if container_config_file_path.exists():
container_config_file_path.unlink()
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/stop_clp.py around
lines 135 to 137, after stopping the GC container you need to also remove its
generated config file from logs_dir to match other components; locate the GC
config filename using the same naming convention used elsewhere (component name
+ instance id, e.g. GARBAGE_COLLECTOR_COMPONENT_NAME and instance_id) or derive
it from container_name, build the full path under logs_dir, check if the file
exists and unlink it, and wrap the remove in a try/except that logs a
debug/error on failure so stale GC config files are cleaned up safely.

@haiqi96 haiqi96 changed the title fix(package): Add garbage collector to valid components list. fix(clp-package): Start garbage collector as part of the "all" and "controller" targets; Validate retention periods aren't configured when using the Presto query engine. Aug 15, 2025
@haiqi96 haiqi96 merged commit 0892a1c into y-scope:main Aug 15, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants