-
Notifications
You must be signed in to change notification settings - Fork 81
fix(clp-json): Explicitly specify what fields to update in update_archive_metadata
(fixes #1039).
#1043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Walkthrough
The `update_archive_metadata` function was revised to enforce stricter control over which keys are inserted into the database. Default values are now managed through a dedicated dictionary, and only a fixed set of keys is permitted for insertion, with errors raised if defaults are already present in the input.
## Changes
| File(s) | Change Summary |
|-------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
| components/job-orchestration/job_orchestration/executor/compress/compression_task.py | Refactored `update_archive_metadata` to use explicit default keys, raise errors on conflicts, and restrict SQL inserts to a fixed key set. |
## Sequence Diagram(s)
```mermaid
sequenceDiagram
participant Caller
participant CompressionTask
Caller->>CompressionTask: update_archive_metadata(archive_stats, creation_ix, creator_id)
CompressionTask->>CompressionTask: Check for 'creation_ix' and 'creator_id' in archive_stats
alt If present
CompressionTask-->>Caller: Raise error
else Not present
CompressionTask->>CompressionTask: Add 'creation_ix' and 'creator_id' to archive_stats
CompressionTask->>CompressionTask: Prepare fixed key list for insertion
CompressionTask->>CompressionTask: Extract values in fixed order
CompressionTask->>CompressionTask: Execute SQL insert with fixed keys/values
CompressionTask-->>Caller: Return result
end Possibly related issues
Suggested reviewers
Learnt from: haiqi96
Learnt from: haiqi96
Learnt from: haiqi96
Learnt from: haiqi96
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (2)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
95-102
: Good implementation of conflict detection for default fields.The explicit validation prevents silent overwrites of fields that should be defaults, which improves data integrity and makes debugging easier when clp-s unexpectedly outputs these fields.
113-118
: Approve the parameterised SQL construction.The explicit field ordering and parameterised query construction is secure and maintains consistency between the field names and their values. This approach prevents SQL injection and ensures the correct mapping of values to database columns.
stats_to_update = [ | ||
"begin_timestamp", | ||
"end_timestamp", | ||
"id", | ||
"size", | ||
"uncompressed_size", | ||
] | ||
stats_to_update.extend(default_creation_stats.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding validation for required fields from clp-s.
While explicitly defining the expected fields improves control, the function assumes all keys in stats_to_update
exist in archive_stats
. If clp-s fails to output required fields like begin_timestamp
, end_timestamp
, id
, size
, or uncompressed_size
, the subsequent database operation will fail with a KeyError.
Add validation to ensure required fields exist:
stats_to_update = [
"begin_timestamp",
"end_timestamp",
"id",
"size",
"uncompressed_size",
]
stats_to_update.extend(default_creation_stats.keys())
+
+# Validate that all required fields from clp-s are present
+required_clp_fields = ["begin_timestamp", "end_timestamp", "id", "size", "uncompressed_size"]
+missing_fields = [field for field in required_clp_fields if field not in archive_stats]
+if missing_fields:
+ raise ValueError(f"Missing required fields from archive stats: {missing_fields}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
stats_to_update = [ | |
"begin_timestamp", | |
"end_timestamp", | |
"id", | |
"size", | |
"uncompressed_size", | |
] | |
stats_to_update.extend(default_creation_stats.keys()) | |
stats_to_update = [ | |
"begin_timestamp", | |
"end_timestamp", | |
"id", | |
"size", | |
"uncompressed_size", | |
] | |
stats_to_update.extend(default_creation_stats.keys()) | |
# Validate that all required fields from clp-s are present | |
required_clp_fields = ["begin_timestamp", "end_timestamp", "id", "size", "uncompressed_size"] | |
missing_fields = [field for field in required_clp_fields if field not in archive_stats] | |
if missing_fields: | |
raise ValueError(f"Missing required fields from archive stats: {missing_fields}") |
🤖 Prompt for AI Agents
In
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
around lines 104 to 111, the code assumes all keys in stats_to_update exist in
archive_stats, which can cause a KeyError if any required field is missing. Add
validation before using these keys by checking that each required field
(begin_timestamp, end_timestamp, id, size, uncompressed_size) is present in
archive_stats, and handle missing keys appropriately, such as raising a clear
error or providing a default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think it's fine to not check that those keys exist like rabbit is suggesting since part of the intended behaviour of --print-archive-stats is for those fields to exist. We currently don't document what fields are guaranteed to exist in the output of --print-archive-stats, but that's a separate issue. Granted, I'd also be fine with explicitly checking that the fields exist if we want to be cautious.
I think the PR title is fine by me as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about:
fix(clp-json): Explicitly specify what fields to update in `update_archive_metadata` (fixes #1039).
@@ -92,20 +92,30 @@ def update_job_metadata_and_tags(db_cursor, job_id, table_prefix, tag_ids, archi | |||
|
|||
|
|||
def update_archive_metadata(db_cursor, table_prefix, archive_stats): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be clearer for future devs if we implement it as follows:
- Create default_creation_stats with the default values
- Validate the default keys don't exist
- Iterate over
stats_to_update
and add each kv-pair todefault_creation_stats
- The rest of the logic, except in the query, use
default_creation_stats
Obviously, the names wouldn't make sense anymore, so how about renaming:
default_creation_stats
->stats_to_update
stats_to_update
->required_stat_names
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
update_archive_metadata
method. (fixes #1039)update_archive_metadata
(fixes #1039).
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
…chive_metadata` (fixes y-scope#1039). (y-scope#1043) Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Description
As described in #1039 , the current
update_archive_metadata
relies on clp-s to output the exact fields for metadata update. However, this assumption is not always true and has been broken by a recent change.update_archive_metadata
so it expliclity specifies the required fields from the clp-s's archive_stats.default_fields
. We removed begin_ts and end_ts from the default list since they will be output by clp-s (when timestamp is not found, those fields will be 0). We also add a strict check to ensure the fields to add does not exist in the clp-s's archive stats.Checklist
breaking change.
Validation performed
Ran package with storage_engine = clp-s.
Confirmed that a single input file can be compressed without error.
Summary by CodeRabbit
Bug Fixes
Refactor