Skip to content

TUL/Script for correcting dc.date.issued formats#239

Open
jr-rk wants to merge 8 commits intomainfrom
tul/wrong-date-format
Open

TUL/Script for correcting dc.date.issued formats#239
jr-rk wants to merge 8 commits intomainfrom
tul/wrong-date-format

Conversation

@jr-rk
Copy link
Contributor

@jr-rk jr-rk commented Nov 3, 2025

Phases MS MM MK JR JM Total
ETA 0 0 0 0 0 0
Developing 0 0 0 0 0 0
Review 0 0 0 0 0 0
Total - - - - - 0
ETA est. 0
ETA cust. - - - - - 0

Problem description

Wrong date formats (e.g., d.m.yyyy, m.yyyy, ...) corrected by new updater option --fix-date-format.
This mode corrects date formats in existing dc.date.issued fields without using other metadata fields.

Behavior:

  • Null/empty values: Kept untouched
  • Year-only values (YYYY): Kept as-is (e.g., 2020 stays 2020)
  • Full dates with wrong format: Converted to YYYY-MM-DD (e.g., 30.5.20252025-05-30)
  • Partial dates: Converted to year-only format (e.g., 5/20252025)
  • Anomalies: Unparseable dates are logged and reported

(Edited the functionality of add_metadata.py added here)

@jr-rk jr-rk requested a review from Copilot November 3, 2025 16:27
@jr-rk jr-rk self-assigned this Nov 3, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new --fix-date-format mode to the add_metadata tool that corrects date formats in existing dc.date.issued fields. The mode supports relaxed validation rules that preserve year-only dates (YYYY) and partial dates while normalizing full dates to YYYY-MM-DD format.

Key changes:

  • Added is_year_only(), is_valid_relaxed(), and parse_relaxed() methods to the date class for flexible date handling
  • Introduced update_existing_metadata_relaxed() method with relaxed validation rules
  • Added --fix-date-format CLI flag that auto-configures the tool to fix dc.date.issued field
  • Enhanced tracking with new info categories: valid_year_only, null_values, and anomalies

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tools/add_metadata/add_metadata.py Implements relaxed date parsing, validation methods, and new fix-date-format mode with enhanced tracking
tools/add_metadata/README.md Documents the new fix-date-format mode with usage examples and behavior specifications
Comments suppressed due to low confidence (1)

tools/add_metadata/add_metadata.py:183

  • The function update_item references a global variable dspace_be that is not defined at the module level where this function is declared. The dspace_be variable is only initialized later in the if __name__ == '__main__' block (line 469). This will cause a NameError if update_item is called before dspace_be is initialized or from a different context. Consider passing dspace_be as a parameter to this function.
    if dspace_be.client.update_item(item):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jr-rk jr-rk requested a review from milanmajchrak November 4, 2025 08:47
Copy link
Collaborator

@milanmajchrak milanmajchrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copilot

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

id_str = f"Item [{uuid}]: [{self._to_mtd_field}]"

# Check for null/empty values - keep them untouched
if date_str is None or date_str.strip() == "":
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will raise an AttributeError if date_str is None because None doesn't have a strip() method. The condition should be reordered to check for None first: if date_str is None or (date_str and date_str.strip() == \"\") or use if not date_str or date_str.strip() == \"\".

Suggested change
if date_str is None or date_str.strip() == "":
if not date_str or date_str.strip() == "":

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If date_str is None evaluates to True, the entire expression is True and .strip() is never called

Comment on lines +165 to +168
partial_formats = [('%Y-%m', '%Y'), ('%m-%Y', '%Y'),
('%Y/%m', '%Y'), ('%m/%Y', '%Y'),
('%Y.%m', '%Y'), ('%m.%Y', '%Y')]
for parse_fmt, output_fmt in partial_formats:
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output format parameter '%Y' is repeated for all partial formats. Consider simplifying to partial_formats = ['%Y-%m', '%m-%Y', '%Y/%m', '%m/%Y', '%Y.%m', '%m.%Y'] and always use '%Y' as the output format in the loop.

Suggested change
partial_formats = [('%Y-%m', '%Y'), ('%m-%Y', '%Y'),
('%Y/%m', '%Y'), ('%m/%Y', '%Y'),
('%Y.%m', '%Y'), ('%m.%Y', '%Y')]
for parse_fmt, output_fmt in partial_formats:
partial_formats = ['%Y-%m', '%m-%Y', '%Y/%m', '%m/%Y', '%Y.%m', '%m.%Y']
for parse_fmt in partial_formats:

Copilot uses AI. Check for mistakes.
ret_empty_meta = 4

def __init__(self, dspace_be, from_mtd_fields: list, to_mtd_field: list, dry_run: bool = False):
def __init__(self, dspace_be, from_mtd_fields: list, to_mtd_field: list, dry_run: bool = False, fix_date_mode: bool = False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need this relaxed mode

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return updater.ret_invalid_meta

# Convert date to correct format if necessary

Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace on line. This should be removed to maintain code cleanliness.

Suggested change

Copilot uses AI. Check for mistakes.
if date.is_year_only(date_str):
self._info["valid_year_only"].append((uuid, date_val.input))
_logger.info(f"{id_str}: year-only format [{date_str}] - keeping as-is")
elif len(date_str) == 7 and date_str[4] == '-': # YYYY-MM format
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded check for YYYY-MM format is fragile. This checks for exactly 7 characters with a dash at position 4, which could match invalid dates like "abcd-ef". Consider using a regex pattern or the datetime parsing logic already available to validate the format more robustly.

Copilot uses AI. Check for mistakes.
uuid = item['uuid']
id_str = f"Item [{uuid}]: [{self._to_mtd_field}]"

# No null/empty handling - let it crash if needed
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "No null/empty handling - let it crash if needed" indicates intentional lack of defensive programming. However, this could lead to unclear error messages when None or empty values are encountered. Consider at least validating that date_str is not None/empty and providing a clear error message if it is, rather than allowing a generic crash.

Copilot uses AI. Check for mistakes.
f"Forced metadata change but no value found for [{uuid}]")
return updater.ret_empty_meta

# Always use standard validation/parsing (no relaxed mode)
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Always use standard validation/parsing (no relaxed mode)" is unclear and adds no value. Either expand it to explain what it means and why it's important, or remove it.

Suggested change
# Always use standard validation/parsing (no relaxed mode)
# Use strict validation/parsing for metadata values to ensure data integrity.
# Relaxed parsing is intentionally avoided to prevent accepting malformed or ambiguous data.

Copilot uses AI. Check for mistakes.
return True
except ValueError:
pass

Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace on line. This should be removed to maintain code cleanliness.

Suggested change

Copilot uses AI. Check for mistakes.
except ValueError:
# The test format does not match the input date format
continue

Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace on line. This should be removed to maintain code cleanliness.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +246 to +251
"""Update existing metadata with hybrid rules:
- No null/empty handling (will crash on None)
- YYYY formats kept as-is
- Partial dates (YYYY-MM) kept as-is (normalized)
- Invalid dates logged as ANOMALY
"""
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete docstring. The docstring mentions "No null/empty handling (will crash on None)" but doesn't explain the force parameter, the return value, or other important aspects of the method's behavior.

Suggested change
"""Update existing metadata with hybrid rules:
- No null/empty handling (will crash on None)
- YYYY formats kept as-is
- Partial dates (YYYY-MM) kept as-is (normalized)
- Invalid dates logged as ANOMALY
"""
"""
Update existing metadata for an item using hybrid date rules.
Args:
item (dict): The item whose metadata is to be updated.
date_str (str): The date string to validate and update.
force (bool, optional): If True, forces the update even if the date is invalid or not in a preferred format.
If False, only updates if the date is not already valid or in a preferred format.
Behavior:
- No null/empty handling (will crash on None).
- Dates in YYYY format are kept as-is.
- Partial dates (YYYY-MM) are kept as-is (normalized).
- Invalid dates are logged as ANOMALY and not updated unless force=True.
- If force=True, attempts to update the metadata regardless of validity.
Returns:
int: Status code indicating the result:
- updater.ret_already_ok: Metadata was already valid and no update was performed.
- updater.ret_invalid_meta: The date was invalid and could not be updated.
- updater.ret_updated: Metadata was successfully updated.
- updater.ret_failed: An error occurred during the update.
"""

Copilot uses AI. Check for mistakes.
self._info["valid"].append((uuid, date_val.input))
_logger.info(f"{id_str}: partial date format [{date_str}] - keeping as-is")
else:
self._info["valid"].append((uuid, date_val.input))
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent logging behavior. Year-only and partial date formats log info messages (lines 262, 265), but valid full dates in YYYY-MM-DD format do not (line 267). This creates inconsistent verbosity where some "valid" cases are logged and others are silent, making debugging and monitoring more difficult.

Suggested change
self._info["valid"].append((uuid, date_val.input))
self._info["valid"].append((uuid, date_val.input))
_logger.info(f"{id_str}: full date format [{date_str}] - keeping as-is")

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +101
pass

Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace on lines. This should be removed to maintain code cleanliness.

Copilot uses AI. Check for mistakes.
- Invalid dates logged as ANOMALY
"""
uuid = item['uuid']
id_str = f"Item [{uuid}]: [{self._to_mtd_field}]"
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace on line. This should be removed to maintain code cleanliness.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants