Skip to content

solr reindexing through events#873

Open
Paurikova2 wants to merge 20 commits intodtq-devfrom
solr-reindexing-by-events
Open

solr reindexing through events#873
Paurikova2 wants to merge 20 commits intodtq-devfrom
solr-reindexing-by-events

Conversation

@Paurikova2
Copy link
Collaborator

@Paurikova2 Paurikova2 commented Feb 13, 2025

Phases MP MM MB MR JM Total
ETA 0 0 0 0 0 0
Developing 8 0 0 0 0 0
Review 0 0 0 0 0 0
Total - - - - - 0
ETA est. 0
ETA cust. - - - - - 0

Problem description

Modify the reindexing of an item by event.

Summary by CodeRabbit

  • New Features

    • Introduced a new consumer to manage OAI indexing events for communities, collections, items, bundles, and bitstreams, improving event-driven indexing.
    • Added caching and batch re-indexing methods for items to enhance indexing efficiency.
  • Refactor

    • Streamlined event logging for item modifications by consolidating logic into a single method.
    • Removed legacy Solr OAI reindexing dependencies and routines from REST repositories and workflow processes.
  • Chores

    • Updated configuration to include the new OAI event consumer in the default event dispatcher.

@Paurikova2 Paurikova2 linked an issue Feb 13, 2025 that may be closed by this pull request
@Paurikova2
Copy link
Collaborator Author

@milanmajchrak
Please check. In v5, there is a line in the code at https://github.com/dataquest-dev/DSpace/pull/873/files#diff-bb13118f9051e0624b4f4070b505be320b2e7a98dca5b247ca22dab9ac9cef60R136: Context(Constants.Modes.READ_ONLY). But in v7, the Context() is used for anonymous read access, isn’t it? So, I removed the READ_ONLY mode. (With this mode, I had the problem that the editor could not see changes.) Is this correct? I think this issue might reflect Ondrej’s potential problem with the anonymous context mentioned in #585.

@Paurikova2 Paurikova2 requested a review from milanmajchrak March 4, 2025 10:02
Copy link
Collaborator

@milanmajchrak milanmajchrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you have a merge conflict

@Paurikova2 Paurikova2 requested a review from milanmajchrak March 6, 2025 20:58
@Paurikova2
Copy link
Collaborator Author

@milanmajchrak I added test to test scenatious.

Copy link
Collaborator

@milanmajchrak milanmajchrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the code duplicities.

@Paurikova2 Paurikova2 requested a review from milanmajchrak March 9, 2025 18:22
milanmajchrak
milanmajchrak previously approved these changes Mar 10, 2025
@milanmajchrak milanmajchrak requested a review from vidiecan March 10, 2025 08:43
@vidiecan
Copy link

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2025

Walkthrough

The changes refactor event logging and indexing functionalities across multiple modules. In the API, event logging for modifications now uses a dedicated method. A new consumer class processes OAI indexing events and a revised XOAI component handles caching and Solr indexing operations. In the server webapp, reindexing logic has been removed from several repositories and the dedicated SolrOAIReindexer class was deleted. A return statement was removed from a method affecting resource policy creation flow. Lastly, the configuration file has been updated to include the xoai event consumer and its filters.

Changes

File(s) Change Summary
dspace-api/.../ResourcePolicyServiceImpl.java
dspace-api/.../ResourcePolicyService.java
Refactored event logging by introducing a new addEventModify method to streamline modification event logging for Items.
dspace-oai/.../OAIIndexEventConsumer.java
dspace-oai/.../XOAI.java
Added a new consumer for OAI indexing events and enhanced XOAI with caching, deletion, and indexing methods, including Spring context initialization.
dspace-server-webapp/.../DSpaceObjectRestRepository.java
dspace-server-webapp/.../ItemRestRepository.java
dspace-server-webapp/.../WorkflowItemRestRepository.java
dspace-server-webapp/.../SolrOAIReindexer.java
Removed all reindexing logic and the SolrOAIReindexer dependency to eliminate automated Solr reindexing in these repositories.
dspace-server-webapp/.../ResourcePolicyRestRepository.java Removed a return statement in createAndReturn method affecting flow for EPerson resource policy creation.
dspace/config/clarin-dspace.cfg Updated event configuration to include the xoai consumer with new filters for various event types and DSpace entities.

Sequence Diagram(s)

sequenceDiagram
    participant ED as Event Dispatcher
    participant OEC as OAIIndexEventConsumer
    participant XOAI as XOAI Component
    participant Solr as Solr Server
    participant Cache as Cache Services

    ED->>OEC: Dispatch Event (Add/Modify/Delete)
    OEC->>OEC: Collect and filter events
    OEC->>XOAI: Trigger index update for Items
    XOAI->>Solr: deleteItemByQuery (for each Item)
    XOAI->>Solr: Re-index Items
    XOAI->>Cache: Commit changes & clear caches
Loading

Poem

Oh, I’m a hopping rabbit so spry,
Skipping through events that fly,
Logging changes with a gentle leap,
Indexing updates wide and deep,
Code realms bloom as I joyously sigh.
🐇🌸


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1883949 and 4a77150.

📒 Files selected for processing (2)
  • dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (1 hunks)
  • dspace-oai/src/main/java/org/dspace/xoai/app/XOAI.java (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java
  • dspace-oai/src/main/java/org/dspace/xoai/app/XOAI.java
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Run Integration Tests
  • GitHub Check: dspace-dependencies / docker-build (linux/amd64, ubuntu-latest, true)
  • GitHub Check: Run Unit Tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (1)

157-169: ⚠️ Potential issue

Remove duplicated code block to avoid compile or runtime issues.

These lines appear to repeat the exception-handling and finalization logic already included above (lines 141-149). Keeping both copies may lead to compiler errors or unexpected behavior.

-            throw e;
-        } finally {
-            if (Objects.nonNull(anonymousContext)) {
-                anonymousContext.complete();
-            }
-        }
-    }
-    
-    public void finish(Context ctx) throws Exception {
-        // No-op
-    }
-}
+    // Remove the duplicated lines (157-169) if they are artifacts from a merge or snippet error
🧹 Nitpick comments (14)
.github/workflows/trigger-builds.yml (1)

19-29: Enhance Shell Script Robustness in Build Triggering

The current script block successfully authenticates with the GitHub CLI, fetches remote branches, and iterates over branches matching the customer/ pattern to trigger their builds.

Points to consider:

  • Error Handling: If no matching branches are found, the loop silently does nothing. It could be useful to add a check for an empty branch list or log a message indicating that no customer branches were detected.
  • Quoting Variables: To prevent potential issues with branch names containing special characters or spaces, consider quoting the branch variable when used in commands.

Below is a suggested diff snippet that incorporates these improvements:

-          git fetch --prune origin  # Ensure remote refs are fetched
-          BRANCHES=$(git ls-remote --heads origin | awk -F'/' '{print $3"/"$4}' | grep '^customer/')
-          for branch in $(echo "$BRANCHES" | sed -e 's/[\[\]"]//g' -e 's/,/\n/g'); do
-            echo "Triggering build for branch: $branch"
-            gh workflow run build.yml --ref $branch
-          done
+          set -e  # Exit on any error
+          git fetch --prune origin  # Ensure remote refs are fetched
+          BRANCHES=$(git ls-remote --heads origin | awk -F'/' '{print $3"/"$4}' | grep '^customer/')
+          if [ -z "$BRANCHES" ]; then
+            echo "No customer branches found to trigger."
+          else
+            for branch in $(echo "$BRANCHES" | sed -e 's/[\[\]"]//g' -e 's/,/\n/g'); do
+              echo "Triggering build for branch: $branch"
+              gh workflow run build.yml --ref "$branch"
+            done
+          fi

These changes will make the script more robust and clearer in its intent.

dspace-api/src/main/resources/org/dspace/storage/rdbms/sqlmigration/h2/V7.6_2023.09.28__enforce_group_or_eperson_for_resourcepolicy.sql (1)

9-9: Cleaning up invalid resource policies.

The SQL statement removes resource policies that don't have an associated EPerson or Group, which is a good preparatory step before enforcing the constraint. Consider adding a comment explaining the significance of this cleanup for future maintainers.

-DELETE FROM ResourcePolicy WHERE eperson_id is null and epersongroup_id is null;
+-- Remove orphaned resource policies with no associated user or group
+DELETE FROM ResourcePolicy WHERE eperson_id is null and epersongroup_id is null;
dspace/config/clarin-dspace.cfg (1)

308-308: Uncommented metadata editing configuration

The line for allowing metadata editing has been uncommented but left empty. This appears to be preparation for future configuration.

Consider adding a comment explaining the intended use of this configuration and why it's currently empty.

dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (3)

36-38: Consider removing or documenting this empty initialize() method.

This method is currently empty and does not appear to be overridden. If it's not required by an interface contract or used by downstream subclasses, you can remove it to reduce boilerplate code.


47-100: Refactor nested ifs to improve maintainability.

The large block of conditional checks in the consume method makes the code difficult to read and maintain. Consider extracting smaller logic units (e.g., separate methods for handling items, bundles, and communities) or using a strategy pattern to reduce the complexity.


114-150: Consider verifying the new application context creation logic.

Creating a new AnnotationConfigApplicationContext for indexing each time may introduce unnecessary overhead. If performance or resource usage becomes a concern, consider reusing an existing context, or deferring initialization until truly needed.

dspace-oai/src/main/java/org/dspace/xoai/app/XOAI.java (3)

112-118: Consider lazy initialization for the Spring application context.

Performing the new AnnotationConfigApplicationContext in a static initializer or instance initializer might increase startup overhead. If performance or memory usage is a concern, consider lazy-loading these beans on demand.


734-744: Centralize or rename isTest() logic for clarity.

Hardcoding "jdbc:h2:mem:test" for detection can be brittle. Consider a configuration property or a dedicated environment check for test mode, especially if future test setups differ.


755-774: Include the original exception when re-throwing for better stack traces.

Currently, the code discards the original exception cause when throwing a new RuntimeException. Consider retaining it:

-    throw new RuntimeException("Cannot reindex the item with ID: " + item.getID() + " because: "
-            + e.getMessage());
+    throw new RuntimeException("Cannot reindex the item with ID: " + item.getID(), e);

This allows upstream handlers to accurately capture the root cause.

dspace-server-webapp/src/test/java/org/dspace/app/sword2/Swordv2IT.java (2)

93-104: Configuration overrides for SWORDv2 in the @before method.

Setting the SWORDv2 properties under test conditions is helpful to align integration tests with the local environment. Consider verifying that these config overrides do not inadvertently persist in other test classes, especially if parallel test execution is enabled.


189-229: depositItemWithEmbargo - Tests item upload with embargo.

This test ensures that a zipped embargoed item is accepted and returns HTTP 201 with the correct ATOM entry content type. Optionally, consider verifying the embargo policy details in the resulting item, but this is a good start.

dspace-server-webapp/src/test/java/org/dspace/app/rest/ResourcePolicyRestRepositoryIT.java (1)

1048-1106: Consider extending negative tests.
This new test method for creating a policy with a group appears correct. As a minor improvement, consider adding negative or edge-case scenarios (e.g., invalid group references).

dspace-api/src/main/java/org/dspace/authorize/ResourcePolicyServiceImpl.java (2)

306-312: Consider refactoring for DRY.
Repeatedly checking if (r.getdSpaceObject() instanceof Item) could be consolidated into a helper method (e.g., addEventModify(context, dSpaceObject)). This reduces duplication and centralizes event logic.


374-377: Repetitive item-event check.
We could reuse the same helper functionality here to streamline event firing for Item objects.

🛑 Comments failed to post (2)
dspace-server-webapp/pom.xml (1)

483-489: 💡 Verification agent

🧩 Analysis chain

Centralized json-smart version management.

Good approach to define a specific version of json-smart as a direct dependency after excluding it from multiple other dependencies. This resolves version conflicts between nimbus-jose-jwt and json-path.

However, it's worth verifying that 2.5.1 is the latest secure version.


🏁 Script executed:

#!/bin/bash
# Check if 2.5.1 is the latest version of json-smart and if there are any security advisories

# Check Maven Central for the latest version
echo "Latest json-smart version in Maven Central:"
curl -s "https://search.maven.org/solrsearch/select?q=g:net.minidev+AND+a:json-smart&rows=20&wt=json" | jq -r '.response.docs[0].latestVersion'

# Check for CVEs related to json-smart
echo -e "\nChecking for security vulnerabilities in json-smart:"
curl -s "https://services.nvd.nist.gov/rest/json/cves/2.0?keywordSearch=json-smart" | jq '.vulnerabilities[] | {cve: .cve.id, description: .cve.descriptions[0].value}'

Length of output: 1263


Action Required: Update json-smart Dependency Version

The current dependency version (2.5.1) in dspace-server-webapp/pom.xml is outdated and vulnerable. Verification indicates that:

  • Maven Central now lists 2.5.2 as the latest version.
  • Security advisories (e.g., CVE-2024-57699) confirm that versions up to 2.5.1 are susceptible to a DoS vulnerability.

Recommendations:

  • Update the dependency version to 2.5.2.
  • Verify that the new version resolves the identified security issue without breaking dependency exclusions.

Diff Snippet Update:

        <!-- Specify the version of json-smart we want to use.
             This resolves version conflicts and addresses a known security vulnerability. -->
        <dependency>
            <groupId>net.minidev</groupId>
            <artifactId>json-smart</artifactId>
            <version>2.5.2</version>
        </dependency>
dspace-api/src/main/java/org/dspace/content/packager/AbstractMETSIngester.java (1)

765-773: ⚠️ Potential issue

Improved policy handling during bitstream crosswalk operations.

This change preserves bitstream policies during the crosswalk process by storing them before the operation and reapplying them afterward. This ensures that authorization settings aren't lost when metadata is updated through the crosswalk operation.

Great enhancement that fixes a potential issue where bitstream permissions could be lost during package ingestion. The policies are now properly preserved and restored after the crosswalk operation.

@milanmajchrak milanmajchrak removed the request for review from vidiecan March 13, 2025 10:39
@Paurikova2
Copy link
Collaborator Author

@milanmajchrak urobila som nejake upravy na zaklade review z coderabbit. Zaujimave je upozornenie na zaciatku, tiez urobil review codu, ktory sa netykal mojho PR

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (2)

66-84: Restructure the multi-branch logic for better readability.

Having many conditions (item, bundle, collection, community, bitstream) makes the code harder to follow. Consider extracting each case into helper methods or switching to a more centralized approach (e.g., a switch statement or polymorphic dispatch) for clarity.


85-114: Streamline repeated checks in event handling logic.

The code checks various subject types (Collections, Communities, Bitstreams, Bundles, Items) and performs different indexing actions. It can be refactored into smaller, reusable methods (e.g., handleCollectionChanges(...), handleBitstreamChanges(...), handleItemChanges(...)) to enhance maintainability.

dspace-oai/src/main/java/org/dspace/xoai/app/XOAI.java (3)

86-89: Consider using constructor injection instead of direct field injection.

Spring recommends using constructor or setter injection to make dependencies more explicit. Field injection can reduce testability and clarity.


735-739: Potential performance concerns with per-item deleteByQuery.

Calling deleteByQuery for each item might be expensive if reindexing a large number of items. Consider batching the deletion queries or using a single query to delete multiple items if you anticipate large collections.


741-755: Stop-on-error approach may lead to partial indexing.

An error in reindexing a single item causes a runtime exception, halting the entire batch. This can leave the index partially updated. Evaluate whether a best-effort approach (continuing with other items) or proper rollback of already-reindexed items is more suitable.

Do you want me to propose a revised approach for handling partial failures or open a new issue to discuss batch reindexing with error handling?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a208d99 and 6a652eb.

📒 Files selected for processing (2)
  • dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (1 hunks)
  • dspace-oai/src/main/java/org/dspace/xoai/app/XOAI.java (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: dspace-dependencies / docker-build (linux/amd64, ubuntu-latest, true)
  • GitHub Check: Run Integration Tests
  • GitHub Check: Run Unit Tests
🔇 Additional comments (3)
dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (2)

37-45: Class-level documentation is concise but sufficient.

The docstring above the class provides a decent overview. Make sure to keep it updated if the class responsibilities evolve.


123-163: Validate the new Context usage in the end method.

Opening a new anonymous Context and creating a Spring ApplicationContext each time might be resource-intensive. Confirm whether this approach is required for isolation or if you can reuse an existing context or bean. If you must create a new context, ensure it’s always released properly on exceptions (which looks correct here).

dspace-oai/src/main/java/org/dspace/xoai/app/XOAI.java (1)

113-116: Check for potential duplicate initialization.

This instance initializer creates a new AnnotationConfigApplicationContext each time an XOAI object is instantiated. Verify if multiple application contexts are truly needed or if you can avoid re-initializing Spring context objects repeatedly.

@Paurikova2 Paurikova2 requested a review from milanmajchrak April 1, 2025 08:41
itemsToUpdate.addAll(((Bundle)subject).getItems());
}
} else if (event.getSubjectType() == Constants.ITEM) {
//any event reindex this item
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comments are not consistent. In some places, you add a space at the beginning of the comment, and in others, you don't. Somewhere you start with the uppersase somewhere not.
I suggest to start with uppercase and add a space before the message.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (4)

43-43: Consider using a final logger.

It's a common best practice in many Java projects to declare loggers as private static final Logger LOGGER = Logger.getLogger(...). This ensures the reference is not reassigned and makes the usage consistent.

- private static Logger log = Logger.getLogger(OAIIndexEventConsumer.class);
+ private static final Logger log = Logger.getLogger(OAIIndexEventConsumer.class);

48-48: Initialize the set at declaration or in the constructor.

Currently, itemsToUpdate is declared as null and is lazily initialized inside the consume method. Consider initializing the set right away to reduce null checks and simplify code flow.

- private Set<Item> itemsToUpdate = null;
+ private Set<Item> itemsToUpdate = new HashSet<>();

68-70: Refactor repeated subject-type checks into a switch or single conditional block.

Multiple if or || conditions for subject types can reduce readability. Consider using a more structured approach (e.g., a switch statement) or a mapping of subject types to actions to lower complexity.


150-158: Ensure the AnnotationConfigApplicationContext is needed at runtime.

Creating a new Spring application context on each call to end might be resource-intensive. If possible, reuse a single application context or a shared bean configuration. Otherwise, just confirm that the overhead is acceptable and that no memory leaks occur when calling close().

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a652eb and 01d6986.

📒 Files selected for processing (2)
  • dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (1 hunks)
  • dspace/config/clarin-dspace.cfg (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • dspace/config/clarin-dspace.cfg
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Run Integration Tests
  • GitHub Check: dspace-dependencies / docker-build (linux/amd64, ubuntu-latest, true)
  • GitHub Check: Run Unit Tests
🔇 Additional comments (1)
dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (1)

127-165:

Details

❓ Verification inconclusive

Validate concurrency and transaction boundaries in the end method.

Using a brand-new Context for indexing without read-only mode can allow changes to be committed. If concurrent events arrive, you might need to ensure that partial indexing operations do not conflict. Consider verifying that simultaneous end calls do not produce data race conditions or inconsistent index states.

In case you want to confirm concurrency handling, you can run a stress test:


🏁 Script executed:

#!/bin/bash
# A simple concurrency check script
# This script calls the same end-processing method across multiple simultaneous events
# and logs if any concurrency-related exceptions occur.

# NOTE: The actual invocation might vary based on project build system and environment.
# Replace "DummyScriptCall" with your method invocation or unit test command
for i in {1..10}; do
    (echo "Simulated event $i" && echo "DummyScriptCall" ) &
done
wait

Length of output: 1038


Concurrency and Transaction Boundary Validation in end Method

Based on the provided simulation output—where ten concurrent invocations of the end-processing routine ran without observable exceptions—the basic stress test indicates that creating a new Context per invocation currently prevents immediate concurrency-related errors. However, note the following:

  • Context Isolation: Each call to end instantiates its own Context (anonymousContext). This appears to help isolate operations, reducing the risk of data races when multiple events occur simultaneously.
  • Simulation Limitations: The test output only confirms that no exceptions were thrown during these dummy invocations. It does not fully simulate real-world transaction scenarios or the effects of multiple threads modifying persistent state.
  • Further Testing Required: To ensure that partial indexing operations do not conflict under production conditions, consider more extensive testing that involves actual database transactions and evaluates the thread safety of the XOAI#indexItems method.

Please verify that these behaviors remain robust in production-like environments to avoid any hidden concurrency or transactional issues.

milanmajchrak
milanmajchrak previously approved these changes Apr 22, 2025
@milanmajchrak milanmajchrak requested a review from vidiecan April 22, 2025 07:16
for (Item item : items) {
try {
deleteItemByQuery(item);
solrServerResolver.getServer().add(this.index(item));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do better here?
what if we throw in .add

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solr keeps changes in memory (transaction log) for performance. Without commit(), those changes aren't written to the actual index files. Queries won't reflect deletions (or any updates) until a commit or auto-commit happens. The commit is also called in the index method and in another place in the code.
If an exception occurs while indexing the item or adding it to the Solr server, the exception is logged, and no further items will be processed. I added these info also to code as comments.

* The indexing is done using the XOAI indexer after all relevant items are collected.
*
* Class is copied from UFAL/CLARIN-DSPACE (https://github.com/ufal/clarin-dspace) and modified by
* @author Michaela Paurikova (michaela.paurikova at dataquest.sk)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use dspace@

XOAI indexer = new XOAI(anonymousContext, false, false);
AnnotationConfigApplicationContext applicationContext = new AnnotationConfigApplicationContext(
new Class[] { BasicConfiguration.class });
applicationContext.getAutowireCapableBeanFactory()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary because we are in dspace-oai?

List<ResourcePolicy> resourcePolicies = find(c, group);
for (ResourcePolicy r : resourcePolicies) {
addEventModify(c, r.getdSpaceObject());
}
Copy link

@kuchtiak-ufal kuchtiak-ufal May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment.
Just consider using shorter form:
find(c,group).forEach(r -> addEventModify(c, r.getdSpaceObject());


public void addEventModify(Context context, DSpaceObject dso) {
if (dso instanceof Item) {
Item item = (Item) dso;
Copy link

@kuchtiak-ufal kuchtiak-ufal May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, casting dso to item is not needed, you can simply use:

            context.addEvent(new Event(Event.MODIFY, -1, null,
                    Constants.ITEM, dso.getID(), ""));

}

Set<Item> filtered = new HashSet<Item>(itemsToUpdate.size());
for (Item item : itemsToUpdate) {
Copy link

@kuchtiak-ufal kuchtiak-ufal May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be replaced with one line, I think:
Set<Item> filtered = itemsToUpdate.stream().filter(item -> item.getHandle() != null).collect(Collectors.toSet());

indexer.indexItems(filtered);
applicationContext.close();
} catch (Exception e) {
itemsToUpdate = null;
Copy link

@kuchtiak-ufal kuchtiak-ufal May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move (itemsToUpdate = null) to finally block.
Thus, the entire catch block could be removed, so there would be just:

try {
   ...
} finally {
   itemsToUpdate = null;
   ...
}

Similarly, line 149 (itemsToUpdate = null) can be removed

DSpaceObject subject = event.getSubject(ctx);
DSpaceObject object = event.getObject(ctx);

int et = event.getEventType();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this line below - to the place, where "et" number is actually needed.


ItemService itemService = ContentServiceFactory.getInstance().getItemService();

// Collect Items, Collections, Communities that need indexing.
Copy link

@kuchtiak-ufal kuchtiak-ufal May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is slightly confusing.

Either change it to
(1)
// Collect Items that need indexing.
or

(2)
use more generic list of DSpaceObjects:

// Collect Items, Collections, Communities that need indexing.
private Set<DSpaceObject> objectsToUpdate = null;

With (2) you'd avoid the need of few casting below in the code

}

public void addEventModify(Context context, DSpaceObject dso) {
if (dso instanceof Item) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I'd prefer if (Objects.nonNull(object) && event.getObjectType() == Constants.ITEM)
You use this check also somewhere else in this pull request (OAIIndexEventConsumer)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

solr-reindexing-after-process-running

4 participants