Skip to content

Conversation

Bill-hbrhbr
Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr commented Apr 11, 2025

Description

Moves the archive metadata update logic out of clp-s and into the compression task worker. Specifically, in clp-s, this PR:

  • Removes the command line arguments and parsing for anything metadata-db-related in clp-s.
  • Removes metadata-db code from ArchiveWriter and JsonParser
  • Removes database related source files from the clp-s target

This is the first step outlined in the dataset feature implementation plan.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Verified correctness by inspecting the SQL database in clp container.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Archive compression now provides enhanced statistics by including begin and end timestamps.
    • Automatic metadata updates during compression improve archival tracking.
  • Refactor

    • Legacy metadata handling and associated configuration options have been removed, streamlining the command-line interface and overall processing.
    • Archive writer no longer relies on external metadata database connections, simplifying initialization and usage.
    • Archive metadata database integration has been fully removed from the compression workflow and configuration.

@Bill-hbrhbr Bill-hbrhbr requested review from gibber9809, wraymo and a team as code owners April 11, 2025 21:45
Copy link
Contributor

coderabbitai bot commented Apr 11, 2025

Walkthrough

This pull request removes the dependency on a MySQL metadata database from several C++ components while enhancing archive statistics reporting and updating archive metadata through the Python compression task. The changes include removing the metadata update functions, constructors, and related member variables from the ArchiveWriter class and associated files, along with eliminating command-line options and build configurations related to metadata DB handling. Additionally, a new constant is added to the Python configuration, and a new function in the compression task now inserts archive metadata into the database.

Changes

File(s) Change Summary
components/clp-py-utils/clp_py_utils/clp_config.py Added constant ARCHIVES_TABLE_SUFFIX = "archives" for naming archives.
components/core/src/clp_s/ArchiveWriter.cpp,
components/core/src/clp_s/ArchiveWriter.hpp
Removed update_metadata_db method and its invocations; replaced constructor taking a metadata DB parameter with a default constructor; removed member variable m_metadata_db; added JSON fields (BeginTimestamp and EndTimestamp) in print_archive_stats; updated method signatures and included new constants header.
components/core/src/clp_s/JsonParser.cpp,
components/core/src/clp_s/JsonParser.hpp
Modified instantiation of ArchiveWriter to use the default constructor; removed references to GlobalMySQLMetadataDB and associated member variable from JsonParserOption.
components/job-orchestration/job_orchestration/executor/compress/compression_task.py Introduced update_archive_metadata function to insert archive metadata into the clp_archives table; updated run_clp to call this function; removed db_config_file_path parameter from make_clp_s_command_and_env and related calls; adjusted control flow to update archive metadata during compression.
components/core/src/clp_s/CommandLineArguments.cpp,
components/core/src/clp_s/CommandLineArguments.hpp
Removed command-line option and internal handling for metadata DB configuration; eliminated usage of GlobalMetadataDBConfig and corresponding member variable and method.
components/core/src/clp_s/clp-s.cpp Removed code initializing metadata DB configuration within the compress function.
components/core/src/clp_s/CMakeLists.txt Removed multiple source files related to database utilities and MySQL metadata DB functionality; added header file for streaming archive constants.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as run_clp
    participant ArchUpd as update_archive_metadata
    participant DB as Database Cursor

    Runner->>ArchUpd: Call update_archive_metadata(db_cursor, table_prefix, archive_stats)
    ArchUpd->>DB: Execute SQL INSERT into clp_archives
    DB-->>ArchUpd: Return execution result
    ArchUpd-->>Runner: Return control
Loading

Suggested reviewers

  • gibber9809

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03d2b51 and a951bda.

📒 Files selected for processing (1)
  • components/core/src/clp_s/ArchiveWriter.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp_s/ArchiveWriter.cpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build (macos-latest)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Bill-hbrhbr Bill-hbrhbr changed the title refactor(clp-s): Remove archive metadata update and add it to compression worker task, refactor(clp-s): Remove archive metadata update and add it to compression worker task. Apr 11, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)

13-13: Recommend using the new constant in the insert query.

You introduced ARCHIVES_TABLE_NAME but still use a literal "clp_archives" in the SQL statement. To maintain consistency, replace literal references with the constant.

- query = f"INSERT INTO clp_archives ({keys}) VALUES ({values})"
+ query = f"INSERT INTO {ARCHIVES_TABLE_NAME} ({keys}) VALUES ({values})"

366-366: Consider handling exceptions in case of metadata update failures.

While calling update_archive_metadata(db_cursor, last_archive_stats) largely appears fine, an exception could disrupt the entire compression run. Including error handling and logging here would improve resilience.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b05781 and 096b81b.

📒 Files selected for processing (6)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (1 hunks)
  • components/core/src/clp_s/ArchiveWriter.cpp (1 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (1 hunks)
  • components/core/src/clp_s/JsonParser.cpp (1 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/ArchiveWriter.cpp
  • components/core/src/clp_s/ArchiveWriter.hpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (7)
components/clp-py-utils/clp_py_utils/clp_config.py (1)

39-39: New constant added for archives table name.

The addition of ARCHIVES_TABLE_NAME follows the project's naming convention for database table constants and is appropriately placed with other table name constants. This will be used by the compression worker task to update archive metadata as mentioned in the PR objectives.

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)

64-64: Added tracking of previous timestamp for compression reporting.

The new variable last_current_time will help track the previous timestamp for accurate compression progress reporting.


83-86: Improved compression status reporting logic.

Added conditional logic to use either the current time or the previous time based on whether the data size has increased, ensuring more accurate speed calculations when reporting compression status.


105-105: Added timestamp update at the end of the loop.

Updates the tracked timestamp at the end of each loop iteration to maintain accurate timing information.

components/core/src/clp_s/JsonParser.cpp (1)

130-130: Removed metadata DB parameter from ArchiveWriter instantiation.

This change removes the metadata database parameter when creating the ArchiveWriter, aligning with the PR objective to remove archive metadata update functionality from this component and move it to the compression worker task.

components/core/src/clp_s/ArchiveWriter.cpp (1)

432-433: Enhanced archive statistics with timestamp information.

Added begin and end timestamps to the archive statistics output, providing valuable temporal context about the archive's data. This will help with operations and debugging by clearly showing the time range covered by the archive.

components/core/src/clp_s/ArchiveWriter.hpp (1)

68-68: Ensure no references rely on the removed metadata constructor.

Removing the DB-specific constructor reflects the new design that no longer updates archive metadata within ArchiveWriter. This is acceptable, provided that no remaining code depends on the old constructor.

@Bill-hbrhbr Bill-hbrhbr changed the title refactor(clp-s): Remove archive metadata update and add it to compression worker task. refactor(compression): Move archive metadata update from clp-s to compression task workers. Apr 11, 2025
@Bill-hbrhbr Bill-hbrhbr changed the title refactor(compression): Move archive metadata update from clp-s to compression task workers. refactor(package): Move archive metadata update from clp-s to compression task workers. Apr 11, 2025
@kirkrodrigues kirkrodrigues self-requested a review April 12, 2025 08:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)

225-236: Consider removing unnecessary parameter from function signature.

The clp_metadata_db_connection_config parameter is no longer passed to make_clp_s_command_and_env but remains in the function signature of run_clp. Consider removing it if it's only used for the CLP storage engine.

def run_clp(
    worker_config: WorkerConfig,
    clp_config: ClpIoConfig,
    clp_home: pathlib.Path,
    logs_dir: pathlib.Path,
    job_id: int,
    task_id: int,
    tag_ids,
    paths_to_compress: PathsToCompress,
    sql_adapter: SQL_Adapter,
-    clp_metadata_db_connection_config,
):

And update the corresponding calls to this function as needed.


86-101: Consider adding error handling for SQL operations.

The update_archive_metadata function lacks error handling for SQL-related exceptions. Adding a try-except block would allow for more graceful error handling and clearer error messages.

def update_archive_metadata(db_cursor, table_prefix, archive_stats):
    archive_stats_defaults = {
        "begin_timestamp": 0,
        "end_timestamp": 0,
        "creator_id": "",
        "creation_ix": 0,
    }
    for k, v in archive_stats_defaults.items():
        archive_stats.setdefault(k, v)
    keys = ", ".join(archive_stats.keys())
    value_placeholders = ", ".join(["%s"] * len(archive_stats))
    query = (
        f"INSERT INTO {table_prefix}{ARCHIVES_TABLE_SUFFIX} ({keys}) VALUES ({value_placeholders})"
    )
+    try:
        db_cursor.execute(query, list(archive_stats.values()))
+    except Exception as e:
+        logger.error(f"Failed to update archive metadata: {e}")
+        raise
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80b2956 and 0452577.

📒 Files selected for processing (3)
  • components/clp-py-utils/clp_py_utils/clp_config.py (1 hunks)
  • components/core/src/clp_s/CMakeLists.txt (0 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py (3 hunks)
💤 Files with no reviewable changes (1)
  • components/core/src/clp_s/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/clp-py-utils/clp_py_utils/clp_config.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • StorageEngine (48-50)
⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (4)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (4)

13-13: Good use of constant instead of hardcoded string.

Using the ARCHIVES_TABLE_SUFFIX constant from clp_py_utils.clp_config instead of hardcoding the table suffix ensures consistency across the codebase and makes future changes easier to maintain.


86-101: Security improvement: Well-implemented parameter substitution for SQL queries.

The implementation of update_archive_metadata properly addresses the SQL injection concern raised in previous reviews. The function:

  1. Sets sensible defaults for required fields
  2. Uses parameter substitution with placeholders instead of string interpolation
  3. Correctly uses the table prefix and constant suffix for table name construction

This is a secure way to handle SQL queries and aligns with database best practices.


178-183: Appropriate signature update to remove database dependency.

Removing the db_config_file_path parameter from make_clp_s_command_and_env and replacing it with use_single_file_archive aligns with the PR objective of moving archive metadata update from clp-s to compression task workers.


364-367: Clean implementation of metadata update logic.

The code correctly extracts the table prefix once and uses it consistently. The conditional update of archive metadata only when using the CLP_S storage engine aligns with the PR objective.

@Bill-hbrhbr Bill-hbrhbr changed the title refactor(package): Move archive metadata update from clp-s to compression task workers. refactor(clp-package): Move archive metadata update from clp-s to compression task workers. Apr 14, 2025
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me -- I'm fine with deleting all of the metadata DB code until we start using it again. Just a small comment for print_archive_stats.

Comment on lines 429 to 436
void ArchiveWriter::print_archive_stats() {
nlohmann::json json_msg;
json_msg["id"] = m_id;
json_msg["begin_timestamp"] = m_timestamp_dict.get_begin_timestamp();
json_msg["end_timestamp"] = m_timestamp_dict.get_end_timestamp();
json_msg["uncompressed_size"] = m_uncompressed_size;
json_msg["size"] = m_compressed_size;
std::cout << json_msg.dump(-1, ' ', true, nlohmann::json::error_handler_t::ignore) << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things while we're touching this method.

  1. Could we mark print_archive_stats as const?
  2. Could we replace these string literals with some constexpr std::string_view that live in ArchiveWriter.hpp?

@Bill-hbrhbr Bill-hbrhbr requested a review from gibber9809 April 15, 2025 13:01
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clp-s part LGTM

@Bill-hbrhbr Bill-hbrhbr merged commit eeb0147 into y-scope:main Apr 15, 2025
21 checks passed
@Bill-hbrhbr Bill-hbrhbr deleted the clp-s-move-metadata-update branch April 15, 2025 15:07
anlowee pushed a commit to anlowee/clp that referenced this pull request Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants