Skip to content

Conversation

quinntaylormitchell
Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell commented Jul 25, 2025

Description

This PR fixes issue #308 by changing the reporting system that provides the user with information about the status of their compression jobs.

Any compression job that encounters at least one invalid path will be failed. The user will be notified that at least one invalid path has been encountered, and they will be directed to a log file that communicates all these invalid paths.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Compressed three arrangements of log files:

  1. one real, one fake
  2. one real, two fake
  3. two fake

The correct log file was produced in all cases, and the user was notified correctly in all cases.

Summary by CodeRabbit

  • New Features
    • User-facing log generated for invalid file-system paths, with a link in the job status message.
    • Clearer failure messages for S3 issues and unsupported input types.
  • Bug Fixes
    • Stricter validation of input paths (including nested paths) prevents partial scheduling when any path is invalid.
    • Improved error handling ensures jobs fail gracefully with actionable messages.
    • Jobs with no generated tasks are no longer automatically failed, reducing false negatives and enabling follow-up via invalid-path logs.

Copy link
Contributor

coderabbitai bot commented Jul 25, 2025

Walkthrough

The scheduler’s filesystem input processing now validates all paths, collects invalid-path messages, and returns them. A new helper writes these messages to a per-job user log. The main search-and-schedule flow handles FS invalid paths and broadens S3 error handling, updates job status accordingly, and tweaks task buffer flushing and sequencing.

Changes

Cohort / File(s) Summary
FS input validation return + gating
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
_process_fs_input_paths now returns List[str] of invalid-path messages. Introduces paths_ok gating to stop buffering after first invalid path; accumulates and logs all invalid internal paths; updated docstring.
User-facing failed path log helper
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Adds write_failed_path_log(invalid_path_messages, logs_directory, job_id) -> Optional[Path] to write collected invalid-path messages to <logs_directory>/user/failed_paths_<job_id>.txt.
Scheduler error handling and flow
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
FS: collects invalid messages, writes user log if present, marks job FAILED with status message, commits, continues. S3: broader try/except; on error marks FAILED with “S3 Failure: {err}”. Unsupported input types explicitly mark FAILED. Adjusted sequencing: flush buffer, then obtain tasks/partitions. Commented-out block that previously failed jobs when no tasks were produced.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant S as Scheduler
  participant FS as _process_fs_input_paths
  participant L as write_failed_path_log
  participant J as JobStore

  rect rgba(230,245,255,0.6)
  note over S: FS input handling (updated)
  S->>FS: Validate and buffer FS paths
  FS-->>S: invalid_path_messages (List[str])
  alt invalid_path_messages not empty
    S->>L: Write failed paths log (messages, logs_dir, job_id)
    L-->>S: log_path or None
    S->>J: Update job -> FAILED with status_msg (log_path or scheduler log)
    S->>J: Commit
    S-->>S: Continue to next job
  else no invalid messages
    S->>S: Flush buffer, collect tasks/partitions
    note over S: Proceed with scheduling flow
  end
  end
Loading
sequenceDiagram
  autonumber
  participant S as Scheduler
  participant S3 as _process_s3_input
  participant J as JobStore

  rect rgba(255,240,230,0.6)
  note over S: S3 input handling (broadened)
  S->>S3: Process S3 input
  alt success
    S-->>S: Continue normal scheduling
  else exception
    S3-->>S: throw Error
    S->>J: Update job -> FAILED with "S3 Failure: {err}"
    S->>J: Commit
    S-->>S: Continue to next job
  end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.12.2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load configuration /ruff.toml
�[1mCause:�[0m Failed to parse /ruff.toml
�[1mCause:�[0m TOML parse error at line 26, column 3
|
26 | "RSE100", # Use of assert detected
| ^^^^^^^^
Unknown rule selector: RSE100

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly summarizes the primary change: compression jobs encountering invalid input paths are failed and the failure is reported to the user, and it references the related issue number. It matches the changes in compression_scheduler that collect invalid-path messages, write a per-job failure log, and update job status to FAILED with a status message. The title is specific, concise, and suitable for a teammate scanning commit history.

@quinntaylormitchell
Copy link
Collaborator Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Jul 25, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@quinntaylormitchell quinntaylormitchell marked this pull request as ready for review July 25, 2025 19:17
@quinntaylormitchell quinntaylormitchell requested a review from a team as a code owner July 25, 2025 19:17
@haiqi96
Copy link
Contributor

haiqi96 commented Jul 28, 2025

Quick question: what would appear on webui if a job partially fail?

Edit: I guess this is irrelevant now since a job will either pass or fail.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20aef3e and cfd13a9.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: haiqi96
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-29T14:04:13.769Z
Learning: User haiqi96 requested creating a GitHub issue to document a bug fix from PR #1136, which addressed MySQL compatibility issues with invalid SQL CAST operations in the WebUI component.
Learnt from: gibber9809
PR: y-scope/clp#504
File: components/core/src/clp_s/search/kql/CMakeLists.txt:29-29
Timestamp: 2024-10-22T15:36:04.655Z
Learning: When reviewing pull requests, focus on the changes within the PR and avoid commenting on issues outside the scope of the PR.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
📚 Learning: in the clp-package compression flow, path validation and error handling is performed at the schedule...
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: in `fs_compression_task.py`, when refactoring the archive processing loop in python, the `src_archiv...
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4)

100-100: Inconsistency between PR objectives and implementation.

The PR objectives state the goal is to "report all partial failures during otherwise-successful compression jobs" and mention adding an append parameter to update_compression_job_metadata() to concatenate error messages. However, the implementation now fails entire jobs when any path is invalid, contradicting the stated objective of allowing partial failures.

Likely an incorrect or invalid review comment.


127-127: Consistent error handling but may be overly aggressive.

Similar to line 112, this raises a ValueError for invalid internal paths within directories, causing the entire job to fail. While consistent with the top-level path handling, this approach prevents processing any valid paths when even one invalid path is encountered.

⛔ Skipped due to learnings
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:256-258
Timestamp: 2024-10-03T14:47:22.293Z
Learning: Error handling for file path validation is performed in the `JsonParser` constructor.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:256-258
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Error handling for file path validation is performed in the `JsonParser` constructor.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:654-656
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `parse_kv_log_event` function, catch-all exception handlers are acceptable because exceptions are intentionally rethrown to be handled gracefully at a higher level.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:654-656
Timestamp: 2024-10-07T21:18:02.510Z
Learning: In the `parse_kv_log_event` function, catch-all exception handlers are acceptable because exceptions are intentionally rethrown to be handled gracefully at a higher level.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

210-223: Verify partial-failure behaviour against PR objectives and issue #308

The current logic aborts the entire compression job on any invalid input path, but the PR’s stated goal is to record individual path failures and allow the overall job to succeed. We need to confirm the intended semantics before merging.

Please:

  • Review components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (lines 210–223) to ensure invalid paths are logged and marked as failed, without terminating the whole job.
  • Consult the CLP issue #308 (internal tracker) or related PR documentation to clarify whether “partial success” behaviour was specified.
  • Update the error-handling logic so that:
    • Each bad path is reported in job metadata.
    • The job status reflects a partial-success state rather than outright failure, if that aligns with the intended fix.

112-112: [file content truncated for brevity]
Let's focus on the function that handles compression scheduling. We'll search for def schedule_compression or similar.

@quinntaylormitchell quinntaylormitchell changed the title fix: Report all partial failures during otherwise-successful compression jobs (fixes #308) fix: Immediately fail any compression job that encounters an invalid input path (fixes #308) Jul 31, 2025
Comment on lines 110 to 112
except ValueError as ex:
logger.error(str(ex))
continue
raise ValueError(str(ex))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are duplicating the try-catch in both this method and it's caller, which is unnecessary. In addition, if we are going to log an exception, we can use logger.exception which will print some fancy stack trace for you.

@kirkrodrigues
Not an expert of error handling so want to get your opinion. I feel it's not a good practice to catch an error and re-raise it. We can go either

  1. Remove the try-except block in _process_fs_input_paths so it will throw the exception, and let caller handle the exception with try-catch. What Quinn currently have in the caller code should work.
  2. Update _process_fs_input_paths run try-catch, but return a True/False to indicate success or failure

I feel we should go with 2? but 1 isn't that bad either imo.

Copy link
Contributor

@haiqi96 haiqi96 Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, now I realized we also need to report the path that causes the error. which gives a good reason to re-raise with a different error message.

Can you raise an error message saying something like "blabla file is invalid blabla" and store that as our status message? (it should be within the 512 char limit if we only report the first file that causes the error)

@kirkrodrigues wonder if you see any better way to propagate the error message? Alternatively, the method can return a tuple of [Bool, Optional[str]] (success_flag, failure_flag), but I don't feel it is any better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it so that the first invalid file path is now reported; @kirkrodrigues let us know if you see a better way to propagate the error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I think about it again, we don't really need to catch in _process_fs_input_paths since we are just rethrowing the error?
we can simply let caller at line 211 to catch the exception (and I think you don't even need to change the caller try except)

I will ask Kirk to confirm

…ess/compression_scheduler.py

Co-authored-by: haiqi96 <14502009+haiqi96@users.noreply.github.com>
Comment on lines 110 to 112
except ValueError as ex:
logger.error(str(ex))
continue
raise ValueError(str(ex))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I think about it again, we don't really need to catch in _process_fs_input_paths since we are just rethrowing the error?
we can simply let caller at line 211 to catch the exception (and I think you don't even need to change the caller try except)

I will ask Kirk to confirm

@quinntaylormitchell
Copy link
Collaborator Author

@haiqi96 and @kirkrodrigues, here are my thoughts on what I should change re. catching the error:

I could omit logger.error(str(ex)) from _process_fs_input_paths, and then it would be logged only once (at the caller around line 211)

OR

I could omit logger.exception(error_msg) (at the caller around line 211) and then it would still be logged only once (in process_fs_input_paths).

Not sure about the whole "double-raising" thing; I don't know how else to pass the error message from validate_path_and_get_info all the way to the caller around line 211.

@kirkrodrigues
Copy link
Member

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).

Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

@haiqi96
Copy link
Contributor

haiqi96 commented Aug 5, 2025

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).

Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).

Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

I am not against it. in that case, the _process_fs_input_paths method needs to somehow let caller knows what path has failed. I have the following proposal:

  1. _process_fs_input_paths() shall return a List[str] of failing path. an empty list means no failure. Alternatively, it can return a Tuple of (success, failing_path): [bool, List[str]]
  2. The caller checks if _process_fs_input_paths succeed. if there's any failing path:
    • It will locally generate a log file, where each line of the log file is a failing path.
    • it shall mark the job as failed without proceeding, with a short error message stating that (1) the job failed because of invalid input path(s), (2), point the user to the generated log file on the disk

@kirkrodrigues any comment?

@kirkrodrigues
Copy link
Member

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).
Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).
Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

I am not against it. in that case, the _process_fs_input_paths method needs to somehow let caller knows what path has failed. I have the following proposal:

1. _process_fs_input_paths() shall return a List[str] of failing path. an empty list means no failure. Alternatively, it can return a Tuple of (success, failing_path): [bool, List[str]]

2. The caller checks if _process_fs_input_paths succeed. if there's any failing path:
   
   * It will locally generate a log file, where each line of the log file is a failing path.
   * it shall mark the job as failed without proceeding, with a short error message stating that (1) the job failed because of invalid input path(s), (2), point the user to the generated log file on the disk

@kirkrodrigues any comment?

Sure

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

267-278: Revise redundant/ambiguous failure branch when no tasks are created.

Given early failure on invalid paths, this branch likely means “no compressible inputs found” (e.g., only empty dirs). Update the message for clarity, or remove the branch if it’s truly unreachable.

-                    "status_msg": "invalid input path",
+                    "status_msg": "No compressible inputs found",
♻️ Duplicate comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

114-119: Avoid catch-and-rethrow; preserve original traceback and remove duplicate logging.

Catching ValueError, logging, and re-raising a new ValueError loses the original traceback and duplicates logs. Let the original exception propagate; the caller already logs with logger.exception. This mirrors prior feedback.

-        except ValueError as ex:
-            logger.error(str(ex))
-            raise ValueError(str(ex))
+        except ValueError:
+            # Let caller log and handle; preserve original traceback.
+            raise

127-134: Same as above for internal paths.

Propagate the original ValueError without re-wrapping and avoid duplicate logging.

-                except ValueError as ex:
-                    logger.error(str(ex))
-                    raise ValueError(str(ex))
+                except ValueError:
+                    raise
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d60b023 and f8cfae2.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)

@quinntaylormitchell quinntaylormitchell changed the title fix: Immediately fail any compression job that encounters an invalid input path (fixes #308) fix: For compression jobs that encounter at least one invalid input path, report path invalidity to user (fixes #308) Sep 2, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

128-137: Also collect invalid internal paths from directory traversal.

Same reasoning as above; don't just flip a flag.

Apply:

-                except ValueError as ex:
-                    logger.error(str(ex))
-                    paths_ok = False
+                except ValueError as ex:
+                    logger.error(str(ex))
+                    invalid_paths.append(str(internal_path))
                     continue

241-254: Bound status_msg to column size; preserve details.

Truncate to avoid DB errors and keep the salient part.

Apply:

-                update_compression_job_metadata(
+                update_compression_job_metadata(
                     db_cursor,
                     job_id,
                     {
                         "status": CompressionJobStatus.FAILED,
-                        "status_msg": f"S3 Failure: {err}",
+                        "status_msg": f"S3 Failure: {err}"[:512],
                     },
                 )
♻️ Duplicate comments (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

224-241: Fail the job early with a per-job invalid-paths log; don't proceed to schedule tasks.

Current code only sets a note and continues, which contradicts the PR objective and the reviewers’ consensus to fail on invalid inputs and point users to a generated log.

Apply:

-        if input_type == InputType.FS.value:
-            paths_ok = _process_fs_input_paths(input_config, paths_to_compress_buffer)
-            if not paths_ok:
-                error_msg = (
-                    "NOTE: At least one input path could not be processed. "
-                    "Check the compression_scheduler log for more details."
-                )
-                update_compression_job_metadata(
-                    db_cursor,
-                    job_id,
-                    {
-                        "status_msg": error_msg,
-                    },
-                )
-                db_conn.commit()
+        if input_type == InputType.FS.value:
+            invalid_paths = _process_fs_input_paths(input_config, paths_to_compress_buffer)
+            if invalid_paths:
+                logs_dir = Path(os.getenv("CLP_LOGS_DIR", "."))
+                invalid_log = logs_dir / f"compression_job_{job_id}_invalid_paths.txt"
+                try:
+                    invalid_log.write_text("\n".join(invalid_paths), encoding="utf-8")
+                except Exception:
+                    logger.exception("Failed to write invalid paths log")
+                error_msg = f"Invalid input path(s). See {invalid_log.name}."
+                update_compression_job_metadata(
+                    db_cursor,
+                    job_id,
+                    {
+                        "status": CompressionJobStatus.FAILED,
+                        "status_msg": error_msg[:512],
+                    },
+                )
+                db_conn.commit()
+                continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f8cfae2 and d12fd3c.

📒 Files selected for processing (2)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (1 hunks)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
components/clp-py-utils/clp_py_utils/compression.py (2)
  • validate_path_and_get_info (94-128)
  • add_file (20-25)
components/job-orchestration/job_orchestration/scheduler/compress/partition.py (2)
  • add_file (52-57)
  • add_empty_directory (59-62)
🔇 Additional comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

272-281: Message clarity LGTM; keep as a safety net.

“All provided paths were invalid.” is clear. If you adopt early-fail on first invalid detection, this block still protects against edge cases where none yielded tasks but no explicit invalids were captured.


144-145: Return the list of invalid paths.

Apply:

-    return paths_ok
+    return invalid_paths

Likely an incorrect or invalid review comment.

@quinntaylormitchell
Copy link
Collaborator Author

@haiqi96 and @kirkrodrigues, I’ve pushed a new implementation. The core elements of the new strategy are a combination of your proposals, as follows:

  1. Don’t fail compression jobs upon encountering an invalid path; complete the compression job if there is at least one valid path
  2. Log the failing paths somewhere
  3. Inform the user to check the logs for which paths failed

That being said, I felt that part of the solution that Haiqi proposed, i.e.

_process_fs_input_paths() shall return a List[str] of failing path

could instead be fulfilled by the logging infrastructure already used in this function (the failed path is already being recorded to the compression_scheduler log, which can be viewed by the user). With that in mind, it makes more sense to me to return bool from _process_fs_input_paths; if we are simply reporting to the user that at least one path failed, and telling them to check the logs for the failed paths, then the caller of _process_fs_input_paths doesn’t need to know the path(s) that failed, as the failing path(s) have already been logged in a way that the user can access.

Let me know what you think.

@haiqi96
Copy link
Contributor

haiqi96 commented Sep 2, 2025

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).
Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).
Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

I am not against it. in that case, the _process_fs_input_paths method needs to somehow let caller knows what path has failed. I have the following proposal:

1. _process_fs_input_paths() shall return a List[str] of failing path. an empty list means no failure. Alternatively, it can return a Tuple of (success, failing_path): [bool, List[str]]

2. The caller checks if _process_fs_input_paths succeed. if there's any failing path:
   
   * It will locally generate a log file, where each line of the log file is a failing path.
   * it shall mark the job as failed without proceeding, with a short error message stating that (1) the job failed because of invalid input path(s), (2), point the user to the generated log file on the disk

@kirkrodrigues any comment?

Sure

@haiqi96 and @kirkrodrigues, I’ve pushed a new implementation. The core elements of the new strategy are a combination of your proposals, as follows:

  1. Don’t fail compression jobs upon encountering an invalid path; complete the compression job if there is at least one valid path
  2. Log the failing paths somewhere
  3. Inform the user to check the logs for which paths failed

That being said, I felt that part of the solution that Haiqi proposed, i.e.

_process_fs_input_paths() shall return a List[str] of failing path

could instead be fulfilled by the logging infrastructure already used in this function (the failed path is already being recorded to the compression_scheduler log, which can be viewed by the user). With that in mind, it makes more sense to me to return bool from _process_fs_input_paths; if we are simply reporting to the user that at least one path failed, and telling them to check the logs for the failed paths, then the caller of _process_fs_input_paths doesn’t need to know the path(s) that failed, as the failing path(s) have already been logged in a way that the user can access.

Let me know what you think.

I don't agree with the approach of logging the paths in compression_scheduer.

  1. It's easier to point user to a specific path, rather than saying "check compression_scheduer log" since most users (and even some people in our team who don't use package frequently) don't know where is compression_scheduer's log.
  2. Imagine if the compression scheduler has been running for a while and logged many different things, it would be hard for user to find the exact portion of logs that he/she should read.

Because of the points above, it would makes sense to generate a separate log file for clear/efficient error reporting.

@quinntaylormitchell
Copy link
Collaborator Author

most users (and even some people in our team who don't use package frequently) don't know where is compression_scheduer's log

@haiqi96 that's true. But you also mention:

It's easier to point user to a specific path

... I could rig it so that it prints out the specific path of the compression_scheduler log for the user?

As for this point:

it would be hard for user to find the exact portion of logs that he/she should read

that's fair.

@haiqi96, I guess my objection to your proposal that "_process_fs_input_paths() shall return a List[str]" lies in the fact that the paths are already being logged in the compression_scheduler logs; my approach just directs the user there.

Also, see my comment re. the "immediate failure" vs "continue compression" discussion.

@quinntaylormitchell quinntaylormitchell changed the title fix: For compression jobs that encounter at least one invalid input path, report path invalidity to user (fixes #308) fix: For compression jobs that encounter at least one invalid input path, fail job and report failure to user (fixes #308) Sep 15, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f9edda and 0bc05f5.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (6 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-25T06:29:59.610Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
components/clp-py-utils/clp_py_utils/compression.py (2)
  • validate_path_and_get_info (94-128)
  • add_file (20-25)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: package-image
🔇 Additional comments (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

121-126: Confirm whether validate_path_and_get_info already logs; avoid duplicate error entries.

If validate_path_and_get_info doesn’t log, keeping logger.error(error_msg) here is correct. If it does, this will double-log per invalid path. Please confirm and adjust.

Also applies to: 141-145

Comment on lines +324 to +335
# if len(tasks) == 0:
# logger.warning(f"No tasks were created for job {job_id}")
# update_compression_job_metadata(
# db_cursor,
# job_id,
# {
# "status": CompressionJobStatus.FAILED,
# "status_msg": "All provided paths were invalid.",
# },
# )
# db_conn.commit()
# continue
Copy link
Collaborator Author

@quinntaylormitchell quinntaylormitchell Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirkrodrigues do you think that this commented section can be deleted? I have confirmed (best that I can) that this if statement will only be entered if all paths are invalid (there are no other conditions that would cause len(tasks) to be 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants