Skip to content

fix: stabilize queue and move file locations to _locations index#135

Merged
kapral18 merged 1 commit intoelastic:mainfrom
kapral18:fix/queue-stability-and-collisions
Jan 13, 2026
Merged

fix: stabilize queue and move file locations to _locations index#135
kapral18 merged 1 commit intoelastic:mainfrom
kapral18:fix/queue-stability-and-collisions

Conversation

@kapral18
Copy link
Collaborator

@kapral18 kapral18 commented Dec 15, 2025

Summary

This PR fixes several queue/worker correctness issues and introduces a locations-first Elasticsearch storage model.

Elasticsearch data model (new)

Given a base index name <index>:

  • <index>: content-deduplicated chunk documents (semantic search + rich metadata)
  • <index>_locations: one document per chunk occurrence (file path + line range + directory/git metadata)
  • <index>_settings: per-index state (e.g. last indexed commit per branch)

Chunk docs are keyed by a stable chunk_id (the document _id in <index>). Each occurrence is keyed independently in <index>_locations and references chunk_id.

Fixes

How it works (complete, code-level mental model)

0) Identifiers (what is stable and why)

Chunk document id (chunk_id)

  • chunk_id is the Elasticsearch _id used in <index>.
  • It is computed as a SHA256 over content identity only:
chunk_id = sha256(type + ":" + language + ":" + (kind||"") + ":" + (containerPath||"") + ":" + content)

Important invariants:

  • It intentionally does not include file-specific metadata (path, branch, line numbers).
  • Identical content in multiple files maps to the same chunk_id.

Location document id (idempotency key)

  • Each occurrence is stored as one doc in <index>_locations.
  • The _id is another stable SHA256 so retries overwrite the same doc (no duplicates):
location_id = sha256(chunk_id + ":" + filePath + ":" + startLine + ":" + endLine + ":" + (git_branch||""))

1) Index creation / lifecycle

  • Full index creates: <index>, <index>_settings, <index>_locations.
  • Incremental index ensures <index>_locations exists before doing deletes/enqueues (so existing deployments can upgrade without requiring a clean reindex just to create the new index).
  • Standalone worker also ensures <index> and <index>_locations exist before dequeuing work (so it can’t leave rows stuck in processing due to missing indices).

2) Indexing flow (two-phase write)

Parser -> sqlite queue.db -> IndexerWorker -> Elasticsearch

For each input chunk occurrence (one queued row):
  - compute chunk_id (content identity)
  - create chunk doc in <index> (id=chunk_id)
  - create/overwrite location doc in <index>_locations (id=location_id)

2.1 Chunk docs (<index>) are written via bulk create

  • The worker groups input chunks by chunk_id (content dedupe).
  • It sends a bulk request with create operations for those ids.
  • 409 conflicts are treated as success (expected when the chunk already exists).

Why create (not index/update):

  • Writing the same chunk doc again can cause re-inference / unnecessary work for semantic_text.
  • create makes “first writer wins” explicit; later occurrences don’t mutate chunk docs.

2.2 Location docs (<index>_locations) are written via bulk index

  • One location doc per occurrence (filePath + range + metadata).
  • Uses index (overwrite) with the stable location_id so retries are idempotent.
  • If the exact same location_id appears multiple times within a single batch (e.g. duplicate queue rows), it is de-duplicated and the error (if any) is mapped to all affected inputs.

3) Failure semantics (what gets retried)

Indexing returns a per-input result:

  • succeeded[]: input rows that were fully persisted (chunk doc exists or already existed and location doc was indexed)
  • failed[]: input rows that hit an Elasticsearch error on either phase

Key behaviors:

  • If a chunk create item fails for a given chunk_id, all inputs that map to that chunk_id are marked failed for that batch.
  • Location indexing is skipped for any input already marked failed in chunk phase.
  • If a location index item fails for a given location_id, all inputs that map to that location_id are marked failed.
  • If the entire bulk request throws (network/cluster failure), all affected inputs are marked failed.

The worker uses these results to:

  • commit successful queue rows
  • requeue failed queue rows (with retry tracking)

4) Read/query patterns (how consumers should query)

Semantic search

  • Query <index>.semantic_text.
  • Results are chunk docs (chunk_id is the hit _id).

File reconstruction / file-level filtering

  • Query <index>_locations by filePath (and optionally directoryPath, line ranges, branch).
  • Join to chunk docs via mget on <index> using chunk_id.

Directory aggregations

  • Aggregate on <index>_locations.directoryPath.
  • If you need language/kind/symbol breakdowns, join the top chunk_ids back to <index>.

5) Incremental deletion (changed/deleted files)

changed/deleted file paths
  -> scan <index>_locations for those filePath values (PIT + search_after + _shard_doc)
  -> bulk delete matching location docs
  -> refresh <index>_locations (visibility for orphan check)
  -> collect affected chunk_id values
  -> delete orphan chunk docs in <index> (chunk_id with zero remaining locations)

Implementation details:

  • Location deletion uses PIT + _shard_doc to paginate safely and keep memory bounded.
  • After deleting location docs, the index is refreshed so the orphan check is correct.
  • Orphan cleanup checks which chunk_ids still exist in <index>_locations via a terms aggregation, then bulk deletes chunk docs for ids that are absent.

Why this design (vs alternatives)

This PR needs to solve two constraints at once:

  • Correctness at scale: a chunk can appear in many files; we must not lose occurrences.
  • Operational stability: writes/updates must stay predictable on very large repos.

Alternatives we considered:

  1. Keep one doc per occurrence in <index> (e.g. make _id = hash(filePath + content))
  • Pros: single-index, simple read model.
  • Cons:
    • duplicates semantic_text / vectors for the same content (index bloat)
    • increases semantic inference work proportionally to number of occurrences
    • makes “show me the canonical chunk for this content” harder
  1. Store all occurrences on the chunk doc (e.g. keep filePaths[]/fileCount on <index> and append/merge)
  • Pros: single query can return chunk + its locations.
  • Cons:
    • produces write hot-spots and very large documents for popular chunks
    • update contention (multiple writers touching the same doc)
    • forces artificial guardrails (caps) to keep documents bounded
  1. Store occurrences in <index>_settings instead of a dedicated index
  • Pros: avoids introducing a new index name.
  • Cons:
    • <index>_settings is intended to be a small, low-churn state index (e.g. commit hashes)
    • mixing high-cardinality, high-write occurrence documents into it increases blast radius
    • prevents tuning lifecycle/refresh/retention separately from settings
  1. Use join/parent-child or nested “occurrence docs” inside <index>
  • Cons: more complex mappings/queries and still ties occurrence scaling to the primary index.

Why <index>_locations wins here:

  • Unbounded locations without mega-documents (one occurrence = one doc).
  • Predictable write behavior: no single chunk doc grows with repo size.
  • Independent operational control: <index>_locations can be refreshed/queried/retained independently of settings and chunk content.

Drawbacks of this solution (and follow-up plan)

This two-index model introduces real trade-offs:

  • Synchronization surface: chunk writes and location writes can partially fail.
  • Query joins: some tools require a two-step query (locations → mget chunks).
  • Migration complexity: introducing a new index name requires careful cutovers.

Follow-up in #132 (alias-first architecture) will improve these:

  • Use versioned indices (e.g. <alias> -> <index>_vN, <alias>_locations -> <index>_locations_vN) with atomic alias cutover, enabling safer migrations and reindexes.
  • Make it easier to run backfill / repair strategies when one side (chunks vs locations) is temporarily out of sync.
  • Reduce downtime risk by allowing parallel build + switch instead of in-place mutation.

Breaking change (clean reindex required)

  • <index> no longer stores file-level fields (filePath, directory fields, startLine/endLine, git_*).
  • File-level metadata lives exclusively in <index>_locations.
  • A clean reindex is required when migrating existing indices.

MCP impact

Test plan

  • npm run test:unit
  • npm run test:integration

Follow-up

@kapral18 kapral18 force-pushed the fix/queue-stability-and-collisions branch from 3b47fee to f7fb1da Compare December 15, 2025 02:34
Coolomina
Coolomina previously approved these changes Dec 15, 2025
@kapral18 kapral18 force-pushed the fix/queue-stability-and-collisions branch 2 times, most recently from 1aef7ff to 64ccd34 Compare December 15, 2025 19:14
@kapral18 kapral18 changed the title fix: resolve queue crashes, stuck tasks, and content collisions fix: resolve queue crashes, collisions, and incremental index bugs Dec 15, 2025
@kapral18 kapral18 marked this pull request as draft December 15, 2025 20:17
@kapral18 kapral18 changed the title fix: resolve queue crashes, collisions, and incremental index bugs fix: resolve queue crashes, collisions, and incremental index bugs [DONT MERGE} Dec 15, 2025
@kapral18 kapral18 changed the title fix: resolve queue crashes, collisions, and incremental index bugs [DONT MERGE} fix: resolve queue crashes, collisions, and incremental index bugs [DONT MERGE] Dec 15, 2025
@kapral18 kapral18 changed the title fix: resolve queue crashes, collisions, and incremental index bugs [DONT MERGE] fix: stabilize queue and aggregate identical chunks across files Dec 15, 2025
@kapral18 kapral18 marked this pull request as ready for review December 15, 2025 22:48
Copy link
Member

@simianhacker simianhacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this error when I try and index locally:

[2025-12-16T21:19:16.202Z] [INFO] Bulk operation completed for 1 chunks
[2025-12-16T21:19:16.202Z] [ERROR] Partial bulk failure: 1/1 documents failed {"errors":"[\n  {\n    \"chunk_hash\": \"55ef41cf5036f0991e2febeca14e4663ddac04a20bd76ac584cb6cfe9d314c6f\",\n    \"inputIndex\": 0,\n    \"error\": {\n      \"type\": \"status_exception\",\n      \"reason\": \"Cannot apply update with a script on indices that contain [semantic_text] field(s)\"\n    }\n  }\n]"}

@simianhacker
Copy link
Member

I'm a little uneasy about the filePaths and filePath change. I think this is what is causing us to try and do an updateByQuery. I would rather generate an ID based off the chunk_hash + filePath and have duplicate documents in the index to keep things simple. Also filePaths being a nested field comes with caveats that are difficult for LLMs when using KQL for filtering. The filter goes from filePath: *esql* to filePaths: { path: *esql* }. LLMs typically will try and do filePaths.path: *esql* which is invalid. Even with clear instructions, this always gets messed up.

@kapral18
Copy link
Collaborator Author

kapral18 commented Dec 17, 2025

@simianhacker sorry this wasn't done yet, the painless script approach was an iteration. I should've marked this as draft

@kapral18 kapral18 marked this pull request as draft December 17, 2025 10:18
@kapral18 kapral18 force-pushed the fix/queue-stability-and-collisions branch from dee9298 to fe3d091 Compare January 1, 2026 16:07
@kapral18 kapral18 marked this pull request as ready for review January 1, 2026 17:10
@kapral18 kapral18 requested a review from simianhacker January 1, 2026 17:10
@kapral18 kapral18 marked this pull request as draft January 1, 2026 18:50
kapral18 added a commit to kapral18/semantic-code-search-mcp-server that referenced this pull request Jan 1, 2026
…covery

- Reconstruct files using nested filePaths[].{path,startLine,endLine}
- Aggregate directories via nested filePaths.directoryPath for correct discovery
- Widen filePath typing to string | string[]
- Add tests covering aggregated index behavior

Blocked by: elastic/semantic-code-search-indexer#135
@kapral18 kapral18 marked this pull request as ready for review January 1, 2026 22:41
@kapral18
Copy link
Collaborator Author

kapral18 commented Jan 1, 2026

Hey @simianhacker the PR has changed a lot since your review I also updated the description. Check please, especially the Why this design (vs alternatives) section as I anticipate there will be concerns since the change is pretty invasive.

I also added the complementary change in elastic/semantic-code-search-mcp-server#36. You will need it to be able to test end2end on mcp side. I tested locally and again everything checks out so far.

This architcture to me is a perfect long-term middle-ground that plays really well into our next change with alias first architecture. Can you give this new structure a try with clean reindex and share your thoughts/approve?

@kapral18 kapral18 force-pushed the fix/queue-stability-and-collisions branch from ef4d1a6 to a154f88 Compare January 2, 2026 23:56
@kapral18 kapral18 changed the title fix: stabilize queue and aggregate identical chunks across files fix!: stabilize queue and move file locations to _locations index Jan 4, 2026
@kapral18 kapral18 force-pushed the fix/queue-stability-and-collisions branch from a154f88 to 0a326c4 Compare January 4, 2026 02:32
@kapral18 kapral18 requested a review from Coolomina January 4, 2026 10:15
@kapral18 kapral18 changed the title fix!: stabilize queue and move file locations to _locations index fix: stabilize queue and move file locations to _locations index Jan 4, 2026
@kapral18 kapral18 requested a review from Copilot January 5, 2026 10:12

This comment was marked as outdated.

@kapral18 kapral18 force-pushed the fix/queue-stability-and-collisions branch from 0a326c4 to f26da79 Compare January 5, 2026 10:25
@kapral18 kapral18 requested a review from Copilot January 5, 2026 11:08

This comment was marked as outdated.

@kapral18 kapral18 force-pushed the fix/queue-stability-and-collisions branch from f26da79 to 7bd99d6 Compare January 5, 2026 11:27
@kapral18 kapral18 requested a review from Copilot January 5, 2026 11:47

This comment was marked as outdated.

- Store file paths and line ranges in <index>_locations (one doc per
occurrence)
- Keep <index> as content-deduplicated chunk docs; remove
filePaths/fileCount from primary mapping
- Rewrite indexing + deletion flows to join via chunk_id and clean up
orphan chunk docs

BREAKING CHANGE: primary chunk documents no longer store file-level
metadata; clean reindex required.
@kapral18 kapral18 force-pushed the fix/queue-stability-and-collisions branch from 7bd99d6 to 6c79e8f Compare January 12, 2026 23:37
Copy link
Member

@simianhacker simianhacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kapral18 kapral18 merged commit 56b5ccc into elastic:main Jan 13, 2026
7 checks passed
@kapral18 kapral18 deleted the fix/queue-stability-and-collisions branch January 13, 2026 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants