Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds a new --fix-date-format mode to the add_metadata tool that corrects date formats in existing dc.date.issued fields. The mode supports relaxed validation rules that preserve year-only dates (YYYY) and partial dates while normalizing full dates to YYYY-MM-DD format.
Key changes:
- Added
is_year_only(),is_valid_relaxed(), andparse_relaxed()methods to thedateclass for flexible date handling - Introduced
update_existing_metadata_relaxed()method with relaxed validation rules - Added
--fix-date-formatCLI flag that auto-configures the tool to fixdc.date.issuedfield - Enhanced tracking with new info categories:
valid_year_only,null_values, andanomalies
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tools/add_metadata/add_metadata.py | Implements relaxed date parsing, validation methods, and new fix-date-format mode with enhanced tracking |
| tools/add_metadata/README.md | Documents the new fix-date-format mode with usage examples and behavior specifications |
Comments suppressed due to low confidence (1)
tools/add_metadata/add_metadata.py:183
- The function
update_itemreferences a global variabledspace_bethat is not defined at the module level where this function is declared. Thedspace_bevariable is only initialized later in theif __name__ == '__main__'block (line 469). This will cause aNameErrorifupdate_itemis called beforedspace_beis initialized or from a different context. Consider passingdspace_beas a parameter to this function.
if dspace_be.client.update_item(item):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tools/add_metadata/add_metadata.py
Outdated
| id_str = f"Item [{uuid}]: [{self._to_mtd_field}]" | ||
|
|
||
| # Check for null/empty values - keep them untouched | ||
| if date_str is None or date_str.strip() == "": |
There was a problem hiding this comment.
This will raise an AttributeError if date_str is None because None doesn't have a strip() method. The condition should be reordered to check for None first: if date_str is None or (date_str and date_str.strip() == \"\") or use if not date_str or date_str.strip() == \"\".
| if date_str is None or date_str.strip() == "": | |
| if not date_str or date_str.strip() == "": |
There was a problem hiding this comment.
If date_str is None evaluates to True, the entire expression is True and .strip() is never called
tools/add_metadata/add_metadata.py
Outdated
| partial_formats = [('%Y-%m', '%Y'), ('%m-%Y', '%Y'), | ||
| ('%Y/%m', '%Y'), ('%m/%Y', '%Y'), | ||
| ('%Y.%m', '%Y'), ('%m.%Y', '%Y')] | ||
| for parse_fmt, output_fmt in partial_formats: |
There was a problem hiding this comment.
The output format parameter '%Y' is repeated for all partial formats. Consider simplifying to partial_formats = ['%Y-%m', '%m-%Y', '%Y/%m', '%m/%Y', '%Y.%m', '%m.%Y'] and always use '%Y' as the output format in the loop.
| partial_formats = [('%Y-%m', '%Y'), ('%m-%Y', '%Y'), | |
| ('%Y/%m', '%Y'), ('%m/%Y', '%Y'), | |
| ('%Y.%m', '%Y'), ('%m.%Y', '%Y')] | |
| for parse_fmt, output_fmt in partial_formats: | |
| partial_formats = ['%Y-%m', '%m-%Y', '%Y/%m', '%m/%Y', '%Y.%m', '%m.%Y'] | |
| for parse_fmt in partial_formats: |
tools/add_metadata/add_metadata.py
Outdated
| ret_empty_meta = 4 | ||
|
|
||
| def __init__(self, dspace_be, from_mtd_fields: list, to_mtd_field: list, dry_run: bool = False): | ||
| def __init__(self, dspace_be, from_mtd_fields: list, to_mtd_field: list, dry_run: bool = False, fix_date_mode: bool = False): |
There was a problem hiding this comment.
We do not need this relaxed mode
…ts while normalizing separators and logging anomalies
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return updater.ret_invalid_meta | ||
|
|
||
| # Convert date to correct format if necessary | ||
|
|
There was a problem hiding this comment.
Trailing whitespace on line. This should be removed to maintain code cleanliness.
| if date.is_year_only(date_str): | ||
| self._info["valid_year_only"].append((uuid, date_val.input)) | ||
| _logger.info(f"{id_str}: year-only format [{date_str}] - keeping as-is") | ||
| elif len(date_str) == 7 and date_str[4] == '-': # YYYY-MM format |
There was a problem hiding this comment.
Hardcoded check for YYYY-MM format is fragile. This checks for exactly 7 characters with a dash at position 4, which could match invalid dates like "abcd-ef". Consider using a regex pattern or the datetime parsing logic already available to validate the format more robustly.
| uuid = item['uuid'] | ||
| id_str = f"Item [{uuid}]: [{self._to_mtd_field}]" | ||
|
|
||
| # No null/empty handling - let it crash if needed |
There was a problem hiding this comment.
The comment "No null/empty handling - let it crash if needed" indicates intentional lack of defensive programming. However, this could lead to unclear error messages when None or empty values are encountered. Consider at least validating that date_str is not None/empty and providing a clear error message if it is, rather than allowing a generic crash.
| f"Forced metadata change but no value found for [{uuid}]") | ||
| return updater.ret_empty_meta | ||
|
|
||
| # Always use standard validation/parsing (no relaxed mode) |
There was a problem hiding this comment.
The comment "Always use standard validation/parsing (no relaxed mode)" is unclear and adds no value. Either expand it to explain what it means and why it's important, or remove it.
| # Always use standard validation/parsing (no relaxed mode) | |
| # Use strict validation/parsing for metadata values to ensure data integrity. | |
| # Relaxed parsing is intentionally avoided to prevent accepting malformed or ambiguous data. |
| return True | ||
| except ValueError: | ||
| pass | ||
|
|
There was a problem hiding this comment.
Trailing whitespace on line. This should be removed to maintain code cleanliness.
| except ValueError: | ||
| # The test format does not match the input date format | ||
| continue | ||
|
|
There was a problem hiding this comment.
Trailing whitespace on line. This should be removed to maintain code cleanliness.
| """Update existing metadata with hybrid rules: | ||
| - No null/empty handling (will crash on None) | ||
| - YYYY formats kept as-is | ||
| - Partial dates (YYYY-MM) kept as-is (normalized) | ||
| - Invalid dates logged as ANOMALY | ||
| """ |
There was a problem hiding this comment.
Incomplete docstring. The docstring mentions "No null/empty handling (will crash on None)" but doesn't explain the force parameter, the return value, or other important aspects of the method's behavior.
| """Update existing metadata with hybrid rules: | |
| - No null/empty handling (will crash on None) | |
| - YYYY formats kept as-is | |
| - Partial dates (YYYY-MM) kept as-is (normalized) | |
| - Invalid dates logged as ANOMALY | |
| """ | |
| """ | |
| Update existing metadata for an item using hybrid date rules. | |
| Args: | |
| item (dict): The item whose metadata is to be updated. | |
| date_str (str): The date string to validate and update. | |
| force (bool, optional): If True, forces the update even if the date is invalid or not in a preferred format. | |
| If False, only updates if the date is not already valid or in a preferred format. | |
| Behavior: | |
| - No null/empty handling (will crash on None). | |
| - Dates in YYYY format are kept as-is. | |
| - Partial dates (YYYY-MM) are kept as-is (normalized). | |
| - Invalid dates are logged as ANOMALY and not updated unless force=True. | |
| - If force=True, attempts to update the metadata regardless of validity. | |
| Returns: | |
| int: Status code indicating the result: | |
| - updater.ret_already_ok: Metadata was already valid and no update was performed. | |
| - updater.ret_invalid_meta: The date was invalid and could not be updated. | |
| - updater.ret_updated: Metadata was successfully updated. | |
| - updater.ret_failed: An error occurred during the update. | |
| """ |
| self._info["valid"].append((uuid, date_val.input)) | ||
| _logger.info(f"{id_str}: partial date format [{date_str}] - keeping as-is") | ||
| else: | ||
| self._info["valid"].append((uuid, date_val.input)) |
There was a problem hiding this comment.
Inconsistent logging behavior. Year-only and partial date formats log info messages (lines 262, 265), but valid full dates in YYYY-MM-DD format do not (line 267). This creates inconsistent verbosity where some "valid" cases are logged and others are silent, making debugging and monitoring more difficult.
| self._info["valid"].append((uuid, date_val.input)) | |
| self._info["valid"].append((uuid, date_val.input)) | |
| _logger.info(f"{id_str}: full date format [{date_str}] - keeping as-is") |
| pass | ||
|
|
There was a problem hiding this comment.
Trailing whitespace on lines. This should be removed to maintain code cleanliness.
| - Invalid dates logged as ANOMALY | ||
| """ | ||
| uuid = item['uuid'] | ||
| id_str = f"Item [{uuid}]: [{self._to_mtd_field}]" |
There was a problem hiding this comment.
Trailing whitespace on line. This should be removed to maintain code cleanliness.
Problem description
Wrong date formats (e.g., d.m.yyyy, m.yyyy, ...) corrected by new updater option
--fix-date-format.This mode corrects date formats in existing
dc.date.issuedfields without using other metadata fields.Behavior:
YYYY): Kept as-is (e.g.,2020stays2020)YYYY-MM-DD(e.g.,30.5.2025→2025-05-30)5/2025→2025)(Edited the functionality of
add_metadata.pyadded here)