Do not intertwine different batch operations #1317

aduffeck · 2025-08-07T10:59:50Z

Description

This PR refactors the batch processing implementation within the search service to address concurrency issues and operational inconsistencies present in the previous design. The core change is that each individual operation now receives its own batch, ensuring that operations are isolated and controlled within their context, including child operations that occur during recursive traversals. The orchestration of batch execution is now automatic, but still allows external control for specific scenarios such as space indexing.

Motivation and Context

Previously, the search service's batching mechanism had several weaknesses:

The search engine (currently Bleve) maintained a single batch and a single index, which could be started externally by the service, e.g., during a recursive walk for space indexing.
Once a batch was started externally, all engine-internal operations also operated on that same batch, leading to unpredictable request timing and potential for concurrency issues. Since the service could be triggered by events (e.g., via NATS), this resulted in multiple goroutines writing to the same batch without control over whether instructions were executed immediately or batched for later flushing.
Operations such as Move required updating parent and all nested children, which previously resulted in either many requests or, post-batching, uncontrolled grouping into the shared batch.
The lack of clear batch boundaries made it difficult to reason about when updates were actually sent to the search backend, causing both performance and reliability issues.

The motivation for this refactor is to:

Eliminate accidental sharing of batch state between independent operations.
Prevent concurrency bugs arising from multiple goroutines operating on the same batch.
Allow child operations (e.g., during recursive updates) to be grouped and sent together in a single, well-defined batch.
Provide explicit control for cases where the service needs to manage batching manually (such as large space indexing).

Implementation Details

Each instruction that operates on the engine (Upsert, Move, Delete, Restore, Purge) now creates and uses its own batch.
All child operations triggered during a recursive process (such as Move or space indexing) are added to the same batch as their parent operation, ensuring atomicity and performance.
The batch lifecycle is scoped to the operation, and batches are flushed automatically at the completion of the instruction unless external control is requested.
For special cases like space indexing, the service can explicitly take ownership of a batch, perform multiple operations on it, and decide when to flush.
The internal orchestration ensures that concurrency is managed safely, and no two operations inadvertently share batch state.
This design increases reliability and predictability, and significantly improves performance for bulk operations by reducing the number of requests sent to the backend.

How Has This Been Tested?

existing unit tests

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Technical debt
Tests only (no source changes)

Checklist:

Code changes
Unit tests added
Acceptance tests added
Documentation added

fschade · 2025-08-14T15:01:59Z

Description

This PR refactors the batch processing implementation within the search service to address concurrency issues and operational inconsistencies present in the previous design. The core change is that each individual operation now receives its own batch, ensuring that operations are isolated and controlled within their context, including child operations that occur during recursive traversals. The orchestration of batch execution is now automatic, but still allows external control for specific scenarios such as space indexing.

Motivation and Context

Previously, the search service's batching mechanism had several weaknesses:

The search engine (currently Bleve) maintained a single batch and a single index, which could be started externally by the service, e.g., during a recursive walk for space indexing.
Once a batch was started externally, all engine-internal operations also operated on that same batch, leading to unpredictable request timing and potential for concurrency issues. Since the service could be triggered by events (e.g., via NATS), this resulted in multiple goroutines writing to the same batch without control over whether instructions were executed immediately or batched for later flushing.
Operations such as Move required updating parent and all nested children, which previously resulted in either many requests or, post-batching, uncontrolled grouping into the shared batch.
The lack of clear batch boundaries made it difficult to reason about when updates were actually sent to the search backend, causing both performance and reliability issues.

The motivation for this refactor is to:

Eliminate accidental sharing of batch state between independent operations.
Prevent concurrency bugs arising from multiple goroutines operating on the same batch.
Allow child operations (e.g., during recursive updates) to be grouped and sent together in a single, well-defined batch.
Provide explicit control for cases where the service needs to manage batching manually (such as large space indexing).

Implementation Details

Each instruction that operates on the engine (Upsert, Move, Delete, Restore, Purge) now creates and uses its own batch.
All child operations triggered during a recursive process (such as Move or space indexing) are added to the same batch as their parent operation, ensuring atomicity and performance.
The batch lifecycle is scoped to the operation, and batches are flushed automatically at the completion of the instruction unless external control is requested.
For special cases like space indexing, the service can explicitly take ownership of a batch, perform multiple operations on it, and decide when to flush.
The internal orchestration ensures that concurrency is managed safely, and no two operations inadvertently share batch state.
This design increases reliability and predictability, and significantly improves performance for bulk operations by reducing the number of requests sent to the backend.

… up the search batch processing implementation

rhafer

Quite a mouthful, but lgtm as far as I understand.

aduffeck marked this pull request as ready for review August 7, 2025 13:43

aduffeck marked this pull request as draft August 7, 2025 14:09

fschade force-pushed the improve-batches branch from dca66b2 to cb4c49b Compare August 14, 2025 14:30

fschade self-assigned this Aug 14, 2025

fschade added the Type:Bug label Aug 14, 2025

fschade added this to OpenCloud Team Board Aug 14, 2025

github-project-automation bot moved this to Qualification in OpenCloud Team Board Aug 14, 2025

fschade moved this from Qualification to In Progress in OpenCloud Team Board Aug 14, 2025

fschade added the Status:In-Progress label Aug 14, 2025

fschade marked this pull request as ready for review August 14, 2025 15:04

aduffeck and others added 4 commits August 28, 2025 15:50

Do not intertwine different batch operations

9f9e037

Use batches when restoring or moving items

1003734

enhancement(search): move bleve engine into its own package and clean…

82e75e1

… up the search batch processing implementation

enhancement(search): implement batch api

f615ccc

fschade force-pushed the improve-batches branch from cb4c49b to f615ccc Compare September 2, 2025 11:06

fschade requested a review from rhafer September 2, 2025 11:07

rhafer approved these changes Sep 2, 2025

View reviewed changes

fschade merged commit 2706796 into opencloud-eu:main Sep 2, 2025
54 checks passed

github-project-automation bot moved this from In Progress to Done in OpenCloud Team Board Sep 2, 2025

openclouders mentioned this pull request Sep 2, 2025

🎉 Release 3.5.0 #1444

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not intertwine different batch operations #1317

Do not intertwine different batch operations #1317

Uh oh!

aduffeck commented Aug 7, 2025 •

edited by fschade

Loading

Uh oh!

fschade commented Aug 14, 2025 •

edited

Loading

Uh oh!

rhafer left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Do not intertwine different batch operations #1317

Do not intertwine different batch operations #1317

Uh oh!

Conversation

aduffeck commented Aug 7, 2025 • edited by fschade Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Implementation Details

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

fschade commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Implementation Details

Uh oh!

rhafer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aduffeck commented Aug 7, 2025 •

edited by fschade

Loading

fschade commented Aug 14, 2025 •

edited

Loading