-
Notifications
You must be signed in to change notification settings - Fork 81
refactor(clp-package): Move archive metadata update from clp-s
to compression task workers.
#819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(clp-package): Move archive metadata update from clp-s
to compression task workers.
#819
Conversation
WalkthroughThis pull request removes the dependency on a MySQL metadata database from several C++ components while enhancing archive statistics reporting and updating archive metadata through the Python compression task. The changes include removing the metadata update functions, constructors, and related member variables from the ArchiveWriter class and associated files, along with eliminating command-line options and build configurations related to metadata DB handling. Additionally, a new constant is added to the Python configuration, and a new function in the compression task now inserts archive metadata into the database. Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as run_clp
participant ArchUpd as update_archive_metadata
participant DB as Database Cursor
Runner->>ArchUpd: Call update_archive_metadata(db_cursor, table_prefix, archive_stats)
ArchUpd->>DB: Execute SQL INSERT into clp_archives
DB-->>ArchUpd: Return execution result
ArchUpd-->>Runner: Return control
Suggested reviewers
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (11)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
13-13
: Recommend using the new constant in the insert query.You introduced
ARCHIVES_TABLE_NAME
but still use a literal"clp_archives"
in the SQL statement. To maintain consistency, replace literal references with the constant.- query = f"INSERT INTO clp_archives ({keys}) VALUES ({values})" + query = f"INSERT INTO {ARCHIVES_TABLE_NAME} ({keys}) VALUES ({values})"
366-366
: Consider handling exceptions in case of metadata update failures.While calling
update_archive_metadata(db_cursor, last_archive_stats)
largely appears fine, an exception could disrupt the entire compression run. Including error handling and logging here would improve resilience.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
(3 hunks)components/clp-py-utils/clp_py_utils/clp_config.py
(1 hunks)components/core/src/clp_s/ArchiveWriter.cpp
(1 hunks)components/core/src/clp_s/ArchiveWriter.hpp
(1 hunks)components/core/src/clp_s/JsonParser.cpp
(1 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/JsonParser.cpp
components/core/src/clp_s/ArchiveWriter.cpp
components/core/src/clp_s/ArchiveWriter.hpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (7)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
39-39
: New constant added for archives table name.The addition of
ARCHIVES_TABLE_NAME
follows the project's naming convention for database table constants and is appropriately placed with other table name constants. This will be used by the compression worker task to update archive metadata as mentioned in the PR objectives.components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
64-64
: Added tracking of previous timestamp for compression reporting.The new variable
last_current_time
will help track the previous timestamp for accurate compression progress reporting.
83-86
: Improved compression status reporting logic.Added conditional logic to use either the current time or the previous time based on whether the data size has increased, ensuring more accurate speed calculations when reporting compression status.
105-105
: Added timestamp update at the end of the loop.Updates the tracked timestamp at the end of each loop iteration to maintain accurate timing information.
components/core/src/clp_s/JsonParser.cpp (1)
130-130
: Removed metadata DB parameter from ArchiveWriter instantiation.This change removes the metadata database parameter when creating the ArchiveWriter, aligning with the PR objective to remove archive metadata update functionality from this component and move it to the compression worker task.
components/core/src/clp_s/ArchiveWriter.cpp (1)
432-433
: Enhanced archive statistics with timestamp information.Added begin and end timestamps to the archive statistics output, providing valuable temporal context about the archive's data. This will help with operations and debugging by clearly showing the time range covered by the archive.
components/core/src/clp_s/ArchiveWriter.hpp (1)
68-68
: Ensure no references rely on the removed metadata constructor.Removing the DB-specific constructor reflects the new design that no longer updates archive metadata within
ArchiveWriter
. This is acceptable, provided that no remaining code depends on the old constructor.
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
clp-s
to compression task workers.
clp-s
to compression task workers.clp-s
to compression task workers.
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
225-236
: Consider removing unnecessary parameter from function signature.The
clp_metadata_db_connection_config
parameter is no longer passed tomake_clp_s_command_and_env
but remains in the function signature ofrun_clp
. Consider removing it if it's only used for the CLP storage engine.def run_clp( worker_config: WorkerConfig, clp_config: ClpIoConfig, clp_home: pathlib.Path, logs_dir: pathlib.Path, job_id: int, task_id: int, tag_ids, paths_to_compress: PathsToCompress, sql_adapter: SQL_Adapter, - clp_metadata_db_connection_config, ):
And update the corresponding calls to this function as needed.
86-101
: Consider adding error handling for SQL operations.The
update_archive_metadata
function lacks error handling for SQL-related exceptions. Adding a try-except block would allow for more graceful error handling and clearer error messages.def update_archive_metadata(db_cursor, table_prefix, archive_stats): archive_stats_defaults = { "begin_timestamp": 0, "end_timestamp": 0, "creator_id": "", "creation_ix": 0, } for k, v in archive_stats_defaults.items(): archive_stats.setdefault(k, v) keys = ", ".join(archive_stats.keys()) value_placeholders = ", ".join(["%s"] * len(archive_stats)) query = ( f"INSERT INTO {table_prefix}{ARCHIVES_TABLE_SUFFIX} ({keys}) VALUES ({value_placeholders})" ) + try: db_cursor.execute(query, list(archive_stats.values())) + except Exception as e: + logger.error(f"Failed to update archive metadata: {e}") + raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/clp-py-utils/clp_py_utils/clp_config.py
(1 hunks)components/core/src/clp_s/CMakeLists.txt
(0 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py
(3 hunks)
💤 Files with no reviewable changes (1)
- components/core/src/clp_s/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- components/clp-py-utils/clp_py_utils/clp_config.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageEngine
(48-50)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (4)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (4)
13-13
: Good use of constant instead of hardcoded string.Using the
ARCHIVES_TABLE_SUFFIX
constant fromclp_py_utils.clp_config
instead of hardcoding the table suffix ensures consistency across the codebase and makes future changes easier to maintain.
86-101
: Security improvement: Well-implemented parameter substitution for SQL queries.The implementation of
update_archive_metadata
properly addresses the SQL injection concern raised in previous reviews. The function:
- Sets sensible defaults for required fields
- Uses parameter substitution with placeholders instead of string interpolation
- Correctly uses the table prefix and constant suffix for table name construction
This is a secure way to handle SQL queries and aligns with database best practices.
178-183
: Appropriate signature update to remove database dependency.Removing the
db_config_file_path
parameter frommake_clp_s_command_and_env
and replacing it withuse_single_file_archive
aligns with the PR objective of moving archive metadata update fromclp-s
to compression task workers.
364-367
: Clean implementation of metadata update logic.The code correctly extracts the table prefix once and uses it consistently. The conditional update of archive metadata only when using the CLP_S storage engine aligns with the PR objective.
clp-s
to compression task workers.clp-s
to compression task workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me -- I'm fine with deleting all of the metadata DB code until we start using it again. Just a small comment for print_archive_stats
.
void ArchiveWriter::print_archive_stats() { | ||
nlohmann::json json_msg; | ||
json_msg["id"] = m_id; | ||
json_msg["begin_timestamp"] = m_timestamp_dict.get_begin_timestamp(); | ||
json_msg["end_timestamp"] = m_timestamp_dict.get_end_timestamp(); | ||
json_msg["uncompressed_size"] = m_uncompressed_size; | ||
json_msg["size"] = m_compressed_size; | ||
std::cout << json_msg.dump(-1, ' ', true, nlohmann::json::error_handler_t::ignore) << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor things while we're touching this method.
- Could we mark
print_archive_stats
asconst
? - Could we replace these string literals with some
constexpr std::string_view
that live inArchiveWriter.hpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clp-s part LGTM
…ompression task workers. (y-scope#819)
Description
Moves the archive metadata update logic out of clp-s and into the compression task worker. Specifically, in
clp-s
, this PR:clp-s
.ArchiveWriter
andJsonParser
clp-s
targetThis is the first step outlined in the dataset feature implementation plan.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Refactor