fix: stabilize queue and move file locations to _locations index by kapral18 · Pull Request #135 · elastic/semantic-code-search-indexer

kapral18 · 2025-12-15T01:55:11Z

Summary

This PR fixes several queue/worker correctness issues and introduces a locations-first Elasticsearch storage model.

Elasticsearch data model (new)

Given a base index name <index>:

<index>: content-deduplicated chunk documents (semantic search + rich metadata)
<index>_locations: one document per chunk occurrence (file path + line range + directory/git metadata)
<index>_settings: per-index state (e.g. last indexed commit per branch)

Chunk docs are keyed by a stable chunk_id (the document _id in <index>). Each occurrence is keyed independently in <index>_locations and references chunk_id.

Fixes

Fixes bug: documents with identical content overwrite each other - only one file discoverable via search #121: documents with identical content overwrite each other
- Problem on main: bulk indexing uses _id = chunk_hash in <index>. When two files produce the same chunk_hash, one document overwrites the other, so only one file is discoverable.
- Fix in this PR: keep <index> as content-deduplicated chunk docs (one per chunk_id) and store all file occurrences in <index>_locations (one doc per occurrence). Consumers join locations → chunks by chunk_id.
Fixes bug: SqliteError "too many SQL variables" during stale task recovery and large batch operations #133: SqliteError: too many SQL variables
- Problem on main: large IN (...) operations during commit/requeue/stale recovery exceed SQLite’s bound-parameter limit.
- Fix in this PR: batch queue DB operations into safe chunks so very large batches and stale recovery don’t exceed SQLite limits.
Fixes bug: IndexerWorker exits prematurely while in-flight tasks are still processing, leaving queue stuck #134: IndexerWorker exits prematurely while in-flight tasks are still running
- Problem on main: the worker can observe an empty queue and exit while async indexing work is still in progress, leaving rows stuck.
- Fix in this PR: implement correct drain semantics: exit only when dequeue is empty and there are no active in-flight tasks.
Fixes bug: duplicate chunk_hash can leave SqliteQueue rows stuck in processing #136: duplicate chunk_hash can leave rows stuck in processing
- Problem on main: partial bulk failures can’t be mapped reliably back to queued documents when multiple inputs share the same chunk_hash.
- Fix in this PR: map bulk results back to input queue rows by input index, and defensively requeue any remaining processing rows if a batch throws.

How it works (complete, code-level mental model)

0) Identifiers (what is stable and why)

Chunk document id (chunk_id)

chunk_id is the Elasticsearch _id used in <index>.
It is computed as a SHA256 over content identity only:

chunk_id = sha256(type + ":" + language + ":" + (kind||"") + ":" + (containerPath||"") + ":" + content)

Important invariants:

It intentionally does not include file-specific metadata (path, branch, line numbers).
Identical content in multiple files maps to the same chunk_id.

Location document id (idempotency key)

Each occurrence is stored as one doc in <index>_locations.
The _id is another stable SHA256 so retries overwrite the same doc (no duplicates):

location_id = sha256(chunk_id + ":" + filePath + ":" + startLine + ":" + endLine + ":" + (git_branch||""))

1) Index creation / lifecycle

Full index creates: <index>, <index>_settings, <index>_locations.
Incremental index ensures <index>_locations exists before doing deletes/enqueues (so existing deployments can upgrade without requiring a clean reindex just to create the new index).
Standalone worker also ensures <index> and <index>_locations exist before dequeuing work (so it can’t leave rows stuck in processing due to missing indices).

2) Indexing flow (two-phase write)

Parser -> sqlite queue.db -> IndexerWorker -> Elasticsearch

For each input chunk occurrence (one queued row):
  - compute chunk_id (content identity)
  - create chunk doc in <index> (id=chunk_id)
  - create/overwrite location doc in <index>_locations (id=location_id)

2.1 Chunk docs (<index>) are written via bulk create

The worker groups input chunks by chunk_id (content dedupe).
It sends a bulk request with create operations for those ids.
409 conflicts are treated as success (expected when the chunk already exists).

Why create (not index/update):

Writing the same chunk doc again can cause re-inference / unnecessary work for semantic_text.
create makes “first writer wins” explicit; later occurrences don’t mutate chunk docs.

2.2 Location docs (<index>_locations) are written via bulk index

One location doc per occurrence (filePath + range + metadata).
Uses index (overwrite) with the stable location_id so retries are idempotent.
If the exact same location_id appears multiple times within a single batch (e.g. duplicate queue rows), it is de-duplicated and the error (if any) is mapped to all affected inputs.

3) Failure semantics (what gets retried)

Indexing returns a per-input result:

succeeded[]: input rows that were fully persisted (chunk doc exists or already existed and location doc was indexed)
failed[]: input rows that hit an Elasticsearch error on either phase

Key behaviors:

If a chunk create item fails for a given chunk_id, all inputs that map to that chunk_id are marked failed for that batch.
Location indexing is skipped for any input already marked failed in chunk phase.
If a location index item fails for a given location_id, all inputs that map to that location_id are marked failed.
If the entire bulk request throws (network/cluster failure), all affected inputs are marked failed.

The worker uses these results to:

commit successful queue rows
requeue failed queue rows (with retry tracking)

4) Read/query patterns (how consumers should query)

Semantic search

Query <index>.semantic_text.
Results are chunk docs (chunk_id is the hit _id).

File reconstruction / file-level filtering

Query <index>_locations by filePath (and optionally directoryPath, line ranges, branch).
Join to chunk docs via mget on <index> using chunk_id.

Directory aggregations

Aggregate on <index>_locations.directoryPath.
If you need language/kind/symbol breakdowns, join the top chunk_ids back to <index>.

5) Incremental deletion (changed/deleted files)

changed/deleted file paths
  -> scan <index>_locations for those filePath values (PIT + search_after + _shard_doc)
  -> bulk delete matching location docs
  -> refresh <index>_locations (visibility for orphan check)
  -> collect affected chunk_id values
  -> delete orphan chunk docs in <index> (chunk_id with zero remaining locations)

Implementation details:

Location deletion uses PIT + _shard_doc to paginate safely and keep memory bounded.
After deleting location docs, the index is refreshed so the orphan check is correct.
Orphan cleanup checks which chunk_ids still exist in <index>_locations via a terms aggregation, then bulk deletes chunk docs for ids that are absent.

Why this design (vs alternatives)

This PR needs to solve two constraints at once:

Correctness at scale: a chunk can appear in many files; we must not lose occurrences.
Operational stability: writes/updates must stay predictable on very large repos.

Alternatives we considered:

Keep one doc per occurrence in <index> (e.g. make _id = hash(filePath + content))

Pros: single-index, simple read model.
Cons:
- duplicates semantic_text / vectors for the same content (index bloat)
- increases semantic inference work proportionally to number of occurrences
- makes “show me the canonical chunk for this content” harder

Store all occurrences on the chunk doc (e.g. keep filePaths[]/fileCount on <index> and append/merge)

Pros: single query can return chunk + its locations.
Cons:
- produces write hot-spots and very large documents for popular chunks
- update contention (multiple writers touching the same doc)
- forces artificial guardrails (caps) to keep documents bounded

Store occurrences in <index>_settings instead of a dedicated index

Pros: avoids introducing a new index name.
Cons:
- <index>_settings is intended to be a small, low-churn state index (e.g. commit hashes)
- mixing high-cardinality, high-write occurrence documents into it increases blast radius
- prevents tuning lifecycle/refresh/retention separately from settings

Use join/parent-child or nested “occurrence docs” inside <index>

Cons: more complex mappings/queries and still ties occurrence scaling to the primary index.

Why <index>_locations wins here:

Unbounded locations without mega-documents (one occurrence = one doc).
Predictable write behavior: no single chunk doc grows with repo size.
Independent operational control: <index>_locations can be refreshed/queried/retained independently of settings and chunk content.

Drawbacks of this solution (and follow-up plan)

This two-index model introduces real trade-offs:

Synchronization surface: chunk writes and location writes can partially fail.
Query joins: some tools require a two-step query (locations → mget chunks).
Migration complexity: introducing a new index name requires careful cutovers.

Follow-up in #132 (alias-first architecture) will improve these:

Use versioned indices (e.g. <alias> -> <index>_vN, <alias>_locations -> <index>_locations_vN) with atomic alias cutover, enabling safer migrations and reindexes.
Make it easier to run backfill / repair strategies when one side (chunks vs locations) is temporarily out of sync.
Reduce downtime risk by allowing parallel build + switch instead of in-place mutation.

Breaking change (clean reindex required)

<index> no longer stores file-level fields (filePath, directory fields, startLine/endLine, git_*).
File-level metadata lives exclusively in <index>_locations.
A clean reindex is required when migrating existing indices.

MCP impact

Companion MCP PR (required): fix: adopt locations-first index model (_locations join) semantic-code-search-mcp-server#36

Test plan

npm run test:unit
npm run test:integration

Follow-up

Alias-first / zero-downtime migration architecture: feat: implement alias-first architecture for zero-downtime reindexing #132

simianhacker

I get this error when I try and index locally:

[2025-12-16T21:19:16.202Z] [INFO] Bulk operation completed for 1 chunks
[2025-12-16T21:19:16.202Z] [ERROR] Partial bulk failure: 1/1 documents failed {"errors":"[\n  {\n    \"chunk_hash\": \"55ef41cf5036f0991e2febeca14e4663ddac04a20bd76ac584cb6cfe9d314c6f\",\n    \"inputIndex\": 0,\n    \"error\": {\n      \"type\": \"status_exception\",\n      \"reason\": \"Cannot apply update with a script on indices that contain [semantic_text] field(s)\"\n    }\n  }\n]"}

simianhacker · 2025-12-16T22:26:32Z

I'm a little uneasy about the filePaths and filePath change. I think this is what is causing us to try and do an updateByQuery. I would rather generate an ID based off the chunk_hash + filePath and have duplicate documents in the index to keep things simple. Also filePaths being a nested field comes with caveats that are difficult for LLMs when using KQL for filtering. The filter goes from filePath: *esql* to filePaths: { path: *esql* }. LLMs typically will try and do filePaths.path: *esql* which is invalid. Even with clear instructions, this always gets messed up.

kapral18 · 2025-12-17T10:18:39Z

@simianhacker sorry this wasn't done yet, the painless script approach was an iteration. I should've marked this as draft

Draft PR

…covery - Reconstruct files using nested filePaths[].{path,startLine,endLine} - Aggregate directories via nested filePaths.directoryPath for correct discovery - Widen filePath typing to string | string[] - Add tests covering aggregated index behavior Blocked by: elastic/semantic-code-search-indexer#135

kapral18 · 2026-01-01T22:43:37Z

Hey @simianhacker the PR has changed a lot since your review I also updated the description. Check please, especially the Why this design (vs alternatives) section as I anticipate there will be concerns since the change is pretty invasive.

I also added the complementary change in elastic/semantic-code-search-mcp-server#36. You will need it to be able to test end2end on mcp side. I tested locally and again everything checks out so far.

This architcture to me is a perfect long-term middle-ground that plays really well into our next change with alias first architecture. Can you give this new structure a try with clean reindex and share your thoughts/approve?

- Store file paths and line ranges in <index>_locations (one doc per occurrence) - Keep <index> as content-deduplicated chunk docs; remove filePaths/fileCount from primary mapping - Rewrite indexing + deletion flows to join via chunk_id and clean up orphan chunk docs BREAKING CHANGE: primary chunk documents no longer store file-level metadata; clean reindex required.

simianhacker

LGTM!

https://claude.ai/share/e155d737-5d09-48b3-a2bd-d4fd27b22a9d

kapral18 force-pushed the fix/queue-stability-and-collisions branch from 3b47fee to f7fb1da Compare December 15, 2025 02:34

Coolomina previously approved these changes Dec 15, 2025

View reviewed changes

kapral18 force-pushed the fix/queue-stability-and-collisions branch 2 times, most recently from 1aef7ff to 64ccd34 Compare December 15, 2025 19:14

kapral18 changed the title ~~fix: resolve queue crashes, stuck tasks, and content collisions~~ fix: resolve queue crashes, collisions, and incremental index bugs Dec 15, 2025

kapral18 marked this pull request as draft December 15, 2025 20:17

kapral18 changed the title ~~fix: resolve queue crashes, collisions, and incremental index bugs~~ fix: resolve queue crashes, collisions, and incremental index bugs [DONT MERGE} Dec 15, 2025

kapral18 changed the title ~~fix: resolve queue crashes, collisions, and incremental index bugs [DONT MERGE}~~ fix: resolve queue crashes, collisions, and incremental index bugs [DONT MERGE] Dec 15, 2025

kapral18 changed the title ~~fix: resolve queue crashes, collisions, and incremental index bugs [DONT MERGE]~~ fix: stabilize queue and aggregate identical chunks across files Dec 15, 2025

kapral18 marked this pull request as ready for review December 15, 2025 22:48

simianhacker requested changes Dec 16, 2025

View reviewed changes

kapral18 marked this pull request as draft December 17, 2025 10:18

kapral18 force-pushed the fix/queue-stability-and-collisions branch from dee9298 to fe3d091 Compare January 1, 2026 16:07

kapral18 marked this pull request as ready for review January 1, 2026 17:10

kapral18 requested a review from simianhacker January 1, 2026 17:10

kapral18 marked this pull request as draft January 1, 2026 18:50

kapral18 mentioned this pull request Jan 1, 2026

fix: adopt locations-first index model (_locations join) elastic/semantic-code-search-mcp-server#36

Merged

3 tasks

kapral18 marked this pull request as ready for review January 1, 2026 22:41

kapral18 force-pushed the fix/queue-stability-and-collisions branch 5 times, most recently from d7d7f4e to ef4d1a6 Compare January 2, 2026 20:30

This was referenced Jan 2, 2026

bug: deleteDocumentsByFilePaths can mis-handle legacy docs without filePaths[] #141

Closed

bug: incremental indexing can leave stale indexed locations when languages are disabled #142

Open

kapral18 mentioned this pull request Jan 2, 2026

feat: pool worker threads in full indexing (avoid 1 Worker per file) #139

Open

kapral18 force-pushed the fix/queue-stability-and-collisions branch from ef4d1a6 to a154f88 Compare January 2, 2026 23:56

kapral18 changed the title ~~fix: stabilize queue and aggregate identical chunks across files~~ fix!: stabilize queue and move file locations to _locations index Jan 4, 2026

kapral18 force-pushed the fix/queue-stability-and-collisions branch from a154f88 to 0a326c4 Compare January 4, 2026 02:32

kapral18 requested a review from Coolomina January 4, 2026 10:15

kapral18 changed the title ~~fix!: stabilize queue and move file locations to _locations index~~ fix: stabilize queue and move file locations to _locations index Jan 4, 2026

kapral18 requested a review from Copilot January 5, 2026 10:12

This comment was marked as outdated.

Sign in to view

kapral18 force-pushed the fix/queue-stability-and-collisions branch from 0a326c4 to f26da79 Compare January 5, 2026 10:25

kapral18 requested a review from Copilot January 5, 2026 11:08

This comment was marked as outdated.

Sign in to view

kapral18 force-pushed the fix/queue-stability-and-collisions branch from f26da79 to 7bd99d6 Compare January 5, 2026 11:27

kapral18 requested a review from Copilot January 5, 2026 11:47

This comment was marked as outdated.

Sign in to view

kapral18 force-pushed the fix/queue-stability-and-collisions branch from 7bd99d6 to 6c79e8f Compare January 12, 2026 23:37

simianhacker approved these changes Jan 13, 2026

View reviewed changes

kapral18 merged commit 56b5ccc into elastic:main Jan 13, 2026
7 checks passed

kapral18 deleted the fix/queue-stability-and-collisions branch January 13, 2026 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stabilize queue and move file locations to _locations index#135

fix: stabilize queue and move file locations to _locations index#135
kapral18 merged 1 commit intoelastic:mainfrom
kapral18:fix/queue-stability-and-collisions

kapral18 commented Dec 15, 2025 •

edited

Loading

Uh oh!

simianhacker left a comment

Uh oh!

simianhacker commented Dec 16, 2025

Uh oh!

kapral18 commented Dec 17, 2025 •

edited

Loading

Uh oh!

kapral18 commented Jan 1, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

simianhacker left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kapral18 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Elasticsearch data model (new)

Fixes

How it works (complete, code-level mental model)

0) Identifiers (what is stable and why)

1) Index creation / lifecycle

2) Indexing flow (two-phase write)

3) Failure semantics (what gets retried)

4) Read/query patterns (how consumers should query)

5) Incremental deletion (changed/deleted files)

Why this design (vs alternatives)

Drawbacks of this solution (and follow-up plan)

Breaking change (clean reindex required)

MCP impact

Test plan

Follow-up

Uh oh!

simianhacker left a comment

Choose a reason for hiding this comment

Uh oh!

simianhacker commented Dec 16, 2025

Uh oh!

kapral18 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kapral18 commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

simianhacker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kapral18 commented Dec 15, 2025 •

edited

Loading

kapral18 commented Dec 17, 2025 •

edited

Loading

kapral18 commented Jan 1, 2026 •

edited

Loading