Skip to content

GPU codec: fall back to CPU graph build on flush when GPU is busy#149373

Open
ChrisHegarty wants to merge 7 commits into
elastic:mainfrom
ChrisHegarty:gpu-cpu-fallback-on-flush
Open

GPU codec: fall back to CPU graph build on flush when GPU is busy#149373
ChrisHegarty wants to merge 7 commits into
elastic:mainfrom
ChrisHegarty:gpu-cpu-fallback-on-flush

Conversation

@ChrisHegarty
Copy link
Copy Markdown
Contributor

During heavy indexing, multiple flush operations can compete for GPU resources simultaneously. Previously, flush would block waiting for a GPU resource to become available, which stalls the indexing thread and can cause cascading latency. This is particularly problematic when the GPU is already saturated with merge or other flush operations — the thread just sits idle waiting for its turn.

I've changed the flush path to use a non-blocking tryAcquire instead of a blocking acquire. If the GPU is busy (all resources locked or insufficient memory), flush now builds the HNSW graph on CPU using Lucene's HnswGraphBuilder. The resulting graph is written in the same Lucene99 format, so it's fully searchable by the standard reader. This means flush never blocks on GPU availability — it always makes progress, just potentially slower for that particular relatively small new segment.

To support this, I added tryAcquire to the CuVSResourceManager interface. Both acquire and tryAcquire now delegate to a shared doAcquire implementation with a nonBlocking flag, and I've added a reason parameter for diagnostics so we can see in logs which operation is acquiring or waiting for resources.

I've added tests at three levels: unit tests for the tryAcquire mechanics (including a concurrent contention test), a WriteGraphTests class that validates the CPU fallback produces byte-identical output to Lucene, and two mixed-path format tests that exercise both GPU and CPU paths within the same index on GPU nodes via a randomly-failing resource manager.

When flushing, use tryAcquire (non-blocking) to attempt GPU resource
acquisition. If the GPU is busy, fall back to building the HNSW graph
on CPU using HnswGraphBuilder. This avoids blocking flush threads
waiting for GPU resources during heavy indexing.

Also adds a `reason` parameter to acquire/tryAcquire for improved
diagnostics, and refactors both methods to share a common doAcquire
implementation.
@ChrisHegarty ChrisHegarty added >bug :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.5.0 v9.3.5 v9.4.2 labels May 19, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ChrisHegarty, I've created a changelog YAML for you.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

🔍 Preview links for changed docs

⏳ Building and deploying preview... View progress

This comment will be updated with preview links when the build is complete.

@github-actions
Copy link
Copy Markdown
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

@ChrisHegarty ChrisHegarty added the test-gpu Run tests using a GPU label May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-gpu Run tests using a GPU v9.3.5 v9.4.2 v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants