Skip to content

Conversation

@aduffeck
Copy link
Contributor

@aduffeck aduffeck commented Aug 7, 2025

Description

This PR refactors the batch processing implementation within the search service to address concurrency issues and operational inconsistencies present in the previous design. The core change is that each individual operation now receives its own batch, ensuring that operations are isolated and controlled within their context, including child operations that occur during recursive traversals. The orchestration of batch execution is now automatic, but still allows external control for specific scenarios such as space indexing.

Motivation and Context

Previously, the search service's batching mechanism had several weaknesses:

  • The search engine (currently Bleve) maintained a single batch and a single index, which could be started externally by the service, e.g., during a recursive walk for space indexing.
  • Once a batch was started externally, all engine-internal operations also operated on that same batch, leading to unpredictable request timing and potential for concurrency issues. Since the service could be triggered by events (e.g., via NATS), this resulted in multiple goroutines writing to the same batch without control over whether instructions were executed immediately or batched for later flushing.
  • Operations such as Move required updating parent and all nested children, which previously resulted in either many requests or, post-batching, uncontrolled grouping into the shared batch.
  • The lack of clear batch boundaries made it difficult to reason about when updates were actually sent to the search backend, causing both performance and reliability issues.

The motivation for this refactor is to:

  • Eliminate accidental sharing of batch state between independent operations.
  • Prevent concurrency bugs arising from multiple goroutines operating on the same batch.
  • Allow child operations (e.g., during recursive updates) to be grouped and sent together in a single, well-defined batch.
  • Provide explicit control for cases where the service needs to manage batching manually (such as large space indexing).

Implementation Details

  • Each instruction that operates on the engine (Upsert, Move, Delete, Restore, Purge) now creates and uses its own batch.
  • All child operations triggered during a recursive process (such as Move or space indexing) are added to the same batch as their parent operation, ensuring atomicity and performance.
  • The batch lifecycle is scoped to the operation, and batches are flushed automatically at the completion of the instruction unless external control is requested.
  • For special cases like space indexing, the service can explicitly take ownership of a batch, perform multiple operations on it, and decide when to flush.
  • The internal orchestration ensures that concurrency is managed safely, and no two operations inadvertently share batch state.
  • This design increases reliability and predictability, and significantly improves performance for bulk operations by reducing the number of requests sent to the backend.

How Has This Been Tested?

  • existing unit tests

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Technical debt
  • Tests only (no source changes)

Checklist:

  • Code changes
  • Unit tests added
  • Acceptance tests added
  • Documentation added

@aduffeck aduffeck marked this pull request as ready for review August 7, 2025 13:43
@aduffeck aduffeck marked this pull request as draft August 7, 2025 14:09
@fschade
Copy link
Contributor

fschade commented Aug 14, 2025

Description

This PR refactors the batch processing implementation within the search service to address concurrency issues and operational inconsistencies present in the previous design. The core change is that each individual operation now receives its own batch, ensuring that operations are isolated and controlled within their context, including child operations that occur during recursive traversals. The orchestration of batch execution is now automatic, but still allows external control for specific scenarios such as space indexing.

Motivation and Context

Previously, the search service's batching mechanism had several weaknesses:

  • The search engine (currently Bleve) maintained a single batch and a single index, which could be started externally by the service, e.g., during a recursive walk for space indexing.
  • Once a batch was started externally, all engine-internal operations also operated on that same batch, leading to unpredictable request timing and potential for concurrency issues. Since the service could be triggered by events (e.g., via NATS), this resulted in multiple goroutines writing to the same batch without control over whether instructions were executed immediately or batched for later flushing.
  • Operations such as Move required updating parent and all nested children, which previously resulted in either many requests or, post-batching, uncontrolled grouping into the shared batch.
  • The lack of clear batch boundaries made it difficult to reason about when updates were actually sent to the search backend, causing both performance and reliability issues.

The motivation for this refactor is to:

  • Eliminate accidental sharing of batch state between independent operations.
  • Prevent concurrency bugs arising from multiple goroutines operating on the same batch.
  • Allow child operations (e.g., during recursive updates) to be grouped and sent together in a single, well-defined batch.
  • Provide explicit control for cases where the service needs to manage batching manually (such as large space indexing).

Implementation Details

  • Each instruction that operates on the engine (Upsert, Move, Delete, Restore, Purge) now creates and uses its own batch.
  • All child operations triggered during a recursive process (such as Move or space indexing) are added to the same batch as their parent operation, ensuring atomicity and performance.
  • The batch lifecycle is scoped to the operation, and batches are flushed automatically at the completion of the instruction unless external control is requested.
  • For special cases like space indexing, the service can explicitly take ownership of a batch, perform multiple operations on it, and decide when to flush.
  • The internal orchestration ensures that concurrency is managed safely, and no two operations inadvertently share batch state.
  • This design increases reliability and predictability, and significantly improves performance for bulk operations by reducing the number of requests sent to the backend.

@fschade fschade self-assigned this Aug 14, 2025
@github-project-automation github-project-automation bot moved this to Qualification in OpenCloud Team Board Aug 14, 2025
@fschade fschade moved this from Qualification to In Progress in OpenCloud Team Board Aug 14, 2025
@fschade fschade marked this pull request as ready for review August 14, 2025 15:04
Copy link
Contributor

@rhafer rhafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a mouthful, but lgtm as far as I understand.

@fschade fschade merged commit 2706796 into opencloud-eu:main Sep 2, 2025
54 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in OpenCloud Team Board Sep 2, 2025
@openclouders openclouders mentioned this pull request Sep 2, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants