[Port to dtq-dev] Port ItemMetadataQAChecker curation task from v5 to v7#1237
[Port to dtq-dev] Port ItemMetadataQAChecker curation task from v5 to v7#1237kosarko wants to merge 2 commits intodataquest-dev:dtq-devfrom
Conversation
* Add ItemMetadataQAChecker curation task with tests --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: kosarko <1842385+kosarko@users.noreply.github.com> (cherry picked from commit d5517be)
There was a problem hiding this comment.
Pull request overview
This pull request ports the ItemMetadataQAChecker curation task from DSpace v5 to v7 for the dtq-dev branch. The curation task performs quality assurance checks on item metadata, validating various properties such as dc.type values, language codes, duplicate metadata, and other metadata consistency requirements specific to CLARIN/LINDAT repositories.
Changes:
- Adds a new curation task
ItemMetadataQACheckerthat validates item metadata against CLARIN-specific quality requirements - Registers the new task in both production and test configuration files with the task name "metadataqa"
- Includes comprehensive integration tests covering various validation scenarios (invalid types, missing metadata, incorrect language mappings, duplicate fields)
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| dspace/config/modules/curate.cfg | Registers the ItemMetadataQAChecker curation task as "metadataqa" |
| dspace-api/src/test/data/dspaceFolder/config/modules/curate.cfg | Registers the task in test configuration |
| dspace-api/src/main/java/org/dspace/ctask/general/ItemMetadataQAChecker.java | Main implementation of the metadata QA checker with validation methods for types, languages, relations, and other metadata fields |
| dspace-api/src/test/java/org/dspace/curate/ItemMetadataQACheckerIT.java | Integration tests covering various validation scenarios and edge cases |
| .withTitle("Item With Two Available Dates") | ||
| .withMetadata("dc", "type", null, "corpus") | ||
| .withMetadata("dc", "date", "available", "2020-01-01") | ||
| .build(); | ||
|
|
||
| itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date", | ||
| "available", "en_US", "2021-01-01"); |
There was a problem hiding this comment.
Inconsistent indentation: this block uses 8 spaces for the builder pattern indentation, while the rest of the test file uses 4 spaces. This should be adjusted to match the consistent 4-space indentation used throughout the rest of the file.
| .withTitle("Item With Two Available Dates") | |
| .withMetadata("dc", "type", null, "corpus") | |
| .withMetadata("dc", "date", "available", "2020-01-01") | |
| .build(); | |
| itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date", | |
| "available", "en_US", "2021-01-01"); | |
| .withTitle("Item With Two Available Dates") | |
| .withMetadata("dc", "type", null, "corpus") | |
| .withMetadata("dc", "date", "available", "2020-01-01") | |
| .build(); | |
| itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date", | |
| "available", "en_US", "2021-01-01"); |
| .withMetadata("dc", "date", "available", "2020-01-01") | ||
| .build(); | ||
|
|
||
| itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date", |
There was a problem hiding this comment.
Missing space after comma in method call. Should be "itemWithTwoAvailableDatesAndLang, "dc"" to follow Java code style conventions.
| itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date", | |
| itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang, "dc", "date", |
| .build(); | ||
|
|
||
| itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date", | ||
| "available", "en_US", "2021-01-01"); |
There was a problem hiding this comment.
Inconsistent indentation: this line uses 8 spaces for indentation while the rest of the file uses 4 spaces. Should be adjusted to match the consistent 4-space indentation pattern.
| "available", "en_US", "2021-01-01"); | |
| "available", "en_US", "2021-01-01"); |
| // Validate that each language name corresponds to its ISO code | ||
| for (int i = 0; i < dcsLanguageIso.size(); i++) { | ||
| String expectedLangName = IsoLangCodes.getLangForCode(dcsLanguageIso.get(i).getValue()); | ||
| String actualLangName = languageNames.get(i).getValue(); | ||
| if (!expectedLangName.equals(actualLangName)) { | ||
| throw new CurateException( | ||
| String.format("local.language.name [%s] does not match expected name [%s] for ISO code [%s]", | ||
| actualLangName, expectedLangName, dcsLanguageIso.get(i).getValue()), |
There was a problem hiding this comment.
The validation logic assumes that dc.language.iso and local.language.name metadata values are returned in matching order by index, but this assumption is incorrect. According to DSpace metadata documentation, when retrieving metadata fields, values are sorted by MetadataFieldId first and then by place value. Since dc.language.iso and local.language.name are different fields, there's no guarantee that their ordering will match when accessed by index.
The current implementation will produce false failures when metadata values for these fields have different place orderings. For example, if dc.language.iso values are added with places [0,1,2] and local.language.name values are added with places [2,1,0], the validation will incorrectly report mismatches even if the values actually correspond correctly.
The validation should either: (1) explicitly compare metadata values by their place attribute to ensure proper pairing, or (2) use a different approach that doesn't rely on list index ordering assumptions, such as building a set of expected language names and verifying all actual language names are in that set.
| // Validate that each language name corresponds to its ISO code | |
| for (int i = 0; i < dcsLanguageIso.size(); i++) { | |
| String expectedLangName = IsoLangCodes.getLangForCode(dcsLanguageIso.get(i).getValue()); | |
| String actualLangName = languageNames.get(i).getValue(); | |
| if (!expectedLangName.equals(actualLangName)) { | |
| throw new CurateException( | |
| String.format("local.language.name [%s] does not match expected name [%s] for ISO code [%s]", | |
| actualLangName, expectedLangName, dcsLanguageIso.get(i).getValue()), | |
| // Build maps of expected and actual language names keyed by place | |
| Map<Integer, String> expectedLangNamesByPlace = new HashMap<>(); | |
| Map<Integer, String> isoCodesByPlace = new HashMap<>(); | |
| for (MetadataValue langCodeDC : dcsLanguageIso) { | |
| Integer place = langCodeDC.getPlace(); | |
| String isoCode = langCodeDC.getValue(); | |
| String expectedLangName = IsoLangCodes.getLangForCode(isoCode); | |
| expectedLangNamesByPlace.put(place, expectedLangName); | |
| isoCodesByPlace.put(place, isoCode); | |
| } | |
| Map<Integer, String> actualLangNamesByPlace = new HashMap<>(); | |
| for (MetadataValue languageName : languageNames) { | |
| Integer place = languageName.getPlace(); | |
| String actualLangName = languageName.getValue(); | |
| actualLangNamesByPlace.put(place, actualLangName); | |
| } | |
| // Ensure that the sets of places match between ISO codes and language names | |
| Set<Integer> expectedPlaces = expectedLangNamesByPlace.keySet(); | |
| Set<Integer> actualPlaces = actualLangNamesByPlace.keySet(); | |
| if (!expectedPlaces.equals(actualPlaces)) { | |
| throw new CurateException( | |
| String.format("local.language.name places %s do not match dc.language.iso places %s", | |
| actualPlaces, expectedPlaces), | |
| Curator.CURATE_FAIL); | |
| } | |
| // Validate that each language name corresponds to its ISO code for each place | |
| for (Integer place : expectedPlaces) { | |
| String expectedLangName = expectedLangNamesByPlace.get(place); | |
| String actualLangName = actualLangNamesByPlace.get(place); | |
| if (!expectedLangName.equals(actualLangName)) { | |
| String isoCode = isoCodesByPlace.get(place); | |
| throw new CurateException( | |
| String.format( | |
| "local.language.name [%s] at place [%d] does not match expected name [%s] for ISO code [%s]", | |
| actualLangName, place, expectedLangName, isoCode), |
| curator.curate(context, itemWithIncorrectLanguageName.getHandle()); | ||
| int status = curator.getStatus(TASK_NAME); | ||
| String result = curator.getResult(TASK_NAME); | ||
| System.out.println("Test result: " + result); |
There was a problem hiding this comment.
The System.out.println debug statement should be removed or replaced with a proper logging statement. While System.out.println is used in some other test files in the codebase, it's generally not a best practice for test output. Consider using a logger instead, or remove this line entirely as the assertion message already provides diagnostic information when the test fails.
| System.out.println("Test result: " + result); |
dspace-api/src/main/java/org/dspace/ctask/general/ItemMetadataQAChecker.java
Show resolved
Hide resolved
dspace-api/src/main/java/org/dspace/ctask/general/ItemMetadataQAChecker.java
Outdated
Show resolved
Hide resolved
dspace-api/src/main/java/org/dspace/ctask/general/ItemMetadataQAChecker.java
Outdated
Show resolved
Hide resolved
|
@milanmajchrak updated, but haven't tested |
Port of ufal#1312 to
dtq-dev.