fix: adopt locations-first index model (_locations join)#36
Merged
kapral18 merged 1 commit intoelastic:mainfrom Jan 13, 2026
Merged
fix: adopt locations-first index model (_locations join)#36kapral18 merged 1 commit intoelastic:mainfrom
kapral18 merged 1 commit intoelastic:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the MCP server to support the new aggregated document structure introduced by indexer #135, where identical chunk content from multiple files is stored in a single Elasticsearch document with multiple file locations.
Key changes:
- Modified
read_file_from_chunksto use nestedfilePaths[]array for per-file line ranges instead of top-level fields - Updated
discover_directoriesaggregation to query nestedfilePaths.directoryPathwithreverse_nestedfor accurate directory discovery - Updated type definitions to reflect that
filePathcan now be multi-valued (string | string[])
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/mcp_server/read_file.test.ts | New test file validating reconstruction from aggregated documents with nested filePaths |
| tests/mcp_server/discover_directories.test.ts | Updated mock responses to match new nested aggregation structure |
| tests/elasticsearch/directory_discovery.test.ts | Updated mock responses and query expectations for nested filePaths aggregation |
| src/utils/elasticsearch.ts | Extended CodeChunk interface to support multi-valued filePath and nested filePaths array |
| src/mcp_server/tools/semantic_code_search.ts | Updated filePath type to allow string or string[] |
| src/mcp_server/tools/semantic_code_search.md | Added note about aggregated indices and multi-valued filePath |
| src/mcp_server/tools/read_file.ts | Complete refactor to query and reconstruct files using nested filePaths array |
| src/mcp_server/tools/read_file.md | Added note about aggregated indices |
| src/elasticsearch/directory_discovery.ts | Updated aggregation to use nested filePaths.directoryPath with reverse_nested |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2 tasks
abee667 to
2e1a02d
Compare
- Update all tool + prompt docs for locations-first indices and KQL split behavior - Clarify list_indices file counting uses <index>_locations and filePath filters are location-scoped
2e1a02d to
251d9cc
Compare
simianhacker
approved these changes
Jan 13, 2026
kapral18
added a commit
to elastic/semantic-code-search-indexer
that referenced
this pull request
Jan 13, 2026
## Summary This PR fixes several queue/worker correctness issues and introduces a **locations-first** Elasticsearch storage model. ### Elasticsearch data model (new) Given a base index name `<index>`: - `<index>`: **content-deduplicated chunk documents** (semantic search + rich metadata) - `<index>_locations`: **one document per chunk occurrence** (file path + line range + directory/git metadata) - `<index>_settings`: per-index state (e.g. last indexed commit per branch) Chunk docs are keyed by a stable **`chunk_id`** (the document `_id` in `<index>`). Each occurrence is keyed independently in `<index>_locations` and references `chunk_id`. ## Fixes - Fixes #121: documents with identical content overwrite each other - **Problem on `main`**: bulk indexing uses `_id = chunk_hash` in `<index>`. When two files produce the same `chunk_hash`, one document overwrites the other, so only one file is discoverable. - **Fix in this PR**: keep `<index>` as **content-deduplicated** chunk docs (one per `chunk_id`) and store **all file occurrences** in `<index>_locations` (one doc per occurrence). Consumers join locations → chunks by `chunk_id`. - Fixes #133: `SqliteError: too many SQL variables` - **Problem on `main`**: large `IN (...)` operations during commit/requeue/stale recovery exceed SQLite’s bound-parameter limit. - **Fix in this PR**: batch queue DB operations into safe chunks so very large batches and stale recovery don’t exceed SQLite limits. - Fixes #134: IndexerWorker exits prematurely while in-flight tasks are still running - **Problem on `main`**: the worker can observe an empty queue and exit while async indexing work is still in progress, leaving rows stuck. - **Fix in this PR**: implement correct drain semantics: exit only when dequeue is empty **and** there are no active in-flight tasks. - Fixes #136: duplicate `chunk_hash` can leave rows stuck in processing - **Problem on `main`**: partial bulk failures can’t be mapped reliably back to queued documents when multiple inputs share the same `chunk_hash`. - **Fix in this PR**: map bulk results back to input queue rows by **input index**, and defensively requeue any remaining `processing` rows if a batch throws. ## How it works (complete, code-level mental model) ### 0) Identifiers (what is stable and why) **Chunk document id (`chunk_id`)** - `chunk_id` is the Elasticsearch `_id` used in `<index>`. - It is computed as a SHA256 over **content identity only**: ```text chunk_id = sha256(type + ":" + language + ":" + (kind||"") + ":" + (containerPath||"") + ":" + content) ``` Important invariants: - It intentionally **does not include** file-specific metadata (path, branch, line numbers). - Identical content in multiple files maps to the same `chunk_id`. **Location document id (idempotency key)** - Each occurrence is stored as one doc in `<index>_locations`. - The `_id` is another stable SHA256 so retries overwrite the same doc (no duplicates): ```text location_id = sha256(chunk_id + ":" + filePath + ":" + startLine + ":" + endLine + ":" + (git_branch||"")) ``` ### 1) Index creation / lifecycle - Full index creates: `<index>`, `<index>_settings`, `<index>_locations`. - Incremental index ensures `<index>_locations` exists before doing deletes/enqueues (so existing deployments can upgrade without requiring a clean reindex just to create the new index). - Standalone worker also ensures `<index>` and `<index>_locations` exist before dequeuing work (so it can’t leave rows stuck in `processing` due to missing indices). ### 2) Indexing flow (two-phase write) ```text Parser -> sqlite queue.db -> IndexerWorker -> Elasticsearch For each input chunk occurrence (one queued row): - compute chunk_id (content identity) - create chunk doc in <index> (id=chunk_id) - create/overwrite location doc in <index>_locations (id=location_id) ``` **2.1 Chunk docs (`<index>`) are written via bulk `create`** - The worker groups input chunks by `chunk_id` (content dedupe). - It sends a bulk request with `create` operations for those ids. - `409` conflicts are treated as success (expected when the chunk already exists). Why `create` (not `index`/`update`): - Writing the same chunk doc again can cause re-inference / unnecessary work for `semantic_text`. - `create` makes “first writer wins” explicit; later occurrences don’t mutate chunk docs. **2.2 Location docs (`<index>_locations`) are written via bulk `index`** - One location doc per *occurrence* (filePath + range + metadata). - Uses `index` (overwrite) with the stable `location_id` so retries are idempotent. - If the exact same `location_id` appears multiple times within a single batch (e.g. duplicate queue rows), it is de-duplicated and the error (if any) is mapped to all affected inputs. ### 3) Failure semantics (what gets retried) Indexing returns a per-input result: - `succeeded[]`: input rows that were fully persisted (chunk doc exists or already existed **and** location doc was indexed) - `failed[]`: input rows that hit an Elasticsearch error on either phase Key behaviors: - If a chunk `create` item fails for a given `chunk_id`, **all inputs** that map to that `chunk_id` are marked failed for that batch. - Location indexing is skipped for any input already marked failed in chunk phase. - If a location `index` item fails for a given `location_id`, **all inputs** that map to that `location_id` are marked failed. - If the entire bulk request throws (network/cluster failure), all affected inputs are marked failed. The worker uses these results to: - commit successful queue rows - requeue failed queue rows (with retry tracking) ### 4) Read/query patterns (how consumers should query) **Semantic search** - Query `<index>.semantic_text`. - Results are chunk docs (`chunk_id` is the hit `_id`). **File reconstruction / file-level filtering** - Query `<index>_locations` by `filePath` (and optionally `directoryPath`, line ranges, branch). - Join to chunk docs via `mget` on `<index>` using `chunk_id`. **Directory aggregations** - Aggregate on `<index>_locations.directoryPath`. - If you need language/kind/symbol breakdowns, join the top `chunk_id`s back to `<index>`. ### 5) Incremental deletion (changed/deleted files) ```text changed/deleted file paths -> scan <index>_locations for those filePath values (PIT + search_after + _shard_doc) -> bulk delete matching location docs -> refresh <index>_locations (visibility for orphan check) -> collect affected chunk_id values -> delete orphan chunk docs in <index> (chunk_id with zero remaining locations) ``` Implementation details: - Location deletion uses PIT + `_shard_doc` to paginate safely and keep memory bounded. - After deleting location docs, the index is refreshed so the orphan check is correct. - Orphan cleanup checks which `chunk_id`s still exist in `<index>_locations` via a terms aggregation, then bulk deletes chunk docs for ids that are absent. ## Why this design (vs alternatives) This PR needs to solve two constraints at once: - **Correctness at scale**: a chunk can appear in many files; we must not lose occurrences. - **Operational stability**: writes/updates must stay predictable on very large repos. Alternatives we considered: 1) **Keep one doc per occurrence in `<index>`** (e.g. make `_id = hash(filePath + content)`) - **Pros**: single-index, simple read model. - **Cons**: - duplicates `semantic_text` / vectors for the same content (index bloat) - increases semantic inference work proportionally to number of occurrences - makes “show me the canonical chunk for this content” harder 2) **Store all occurrences on the chunk doc** (e.g. keep `filePaths[]`/`fileCount` on `<index>` and append/merge) - **Pros**: single query can return chunk + its locations. - **Cons**: - produces write hot-spots and very large documents for popular chunks - update contention (multiple writers touching the same doc) - forces artificial guardrails (caps) to keep documents bounded 3) **Store occurrences in `<index>_settings` instead of a dedicated index** - **Pros**: avoids introducing a new index name. - **Cons**: - `<index>_settings` is intended to be a small, low-churn state index (e.g. commit hashes) - mixing high-cardinality, high-write occurrence documents into it increases blast radius - prevents tuning lifecycle/refresh/retention separately from settings 4) **Use join/parent-child or nested “occurrence docs” inside `<index>`** - **Cons**: more complex mappings/queries and still ties occurrence scaling to the primary index. Why `<index>_locations` wins here: - **Unbounded locations without mega-documents** (one occurrence = one doc). - **Predictable write behavior**: no single chunk doc grows with repo size. - **Independent operational control**: `<index>_locations` can be refreshed/queried/retained independently of settings and chunk content. ## Drawbacks of this solution (and follow-up plan) This two-index model introduces real trade-offs: - **Synchronization surface**: chunk writes and location writes can partially fail. - **Query joins**: some tools require a two-step query (locations → `mget` chunks). - **Migration complexity**: introducing a new index name requires careful cutovers. Follow-up in #132 (alias-first architecture) will improve these: - Use **versioned indices** (e.g. `<alias> -> <index>_vN`, `<alias>_locations -> <index>_locations_vN`) with **atomic alias cutover**, enabling safer migrations and reindexes. - Make it easier to run **backfill / repair** strategies when one side (chunks vs locations) is temporarily out of sync. - Reduce downtime risk by allowing **parallel build + switch** instead of in-place mutation. ## Breaking change (clean reindex required) - `<index>` no longer stores file-level fields (`filePath`, directory fields, `startLine`/`endLine`, `git_*`). - File-level metadata lives exclusively in `<index>_locations`. - A clean reindex is required when migrating existing indices. ## MCP impact - Companion MCP PR (required): elastic/semantic-code-search-mcp-server#36 ## Test plan - [x] `npm run test:unit` - [x] `npm run test:integration` ## Follow-up - Alias-first / zero-downtime migration architecture: #132
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR updates the MCP server to match the locations-first Elasticsearch storage model introduced in elastic/semantic-code-search-indexer#135.
Elasticsearch data model expected by this MCP server
Given a base index name
<index>:<index>: content-deduplicated chunk documents (semantic search + chunk-level metadata)<index>_locations: one document per chunk occurrence (file path + line range + directory/git metadata) referencing the chunk bychunk_idAll file-level metadata is resolved from
<index>_locationsand joined back to<index>viachunk_id(typically usingmget).Problem (with indexer #135)
Indexer #135 makes a deliberate breaking change: chunk documents in
<index>no longer store file-level fields (e.g.filePath,startLine,endLine, aggregated file lists).Without this PR:
<index>.Fix in this PR
This PR moves MCP’s file-level reads and KQL evaluation to the correct storage location (
<index>_locations) while keeping chunk-level reads on<index>.Key tool updates:
read_file_from_chunks: reconstructs files by querying<index>_locationsfor(filePath, startLine, endLine, chunk_id)and then fetching chunk content viamgetfrom<index>.<index>_locationsreferenceschunk_ids missing from<index>(visibility into partial-failure / eventual-consistency scenarios).discover_directories: discovers directories by aggregating on<index>_locations.directoryPathand enriching results using chunk docs from<index>.map_symbols_by_query: maps file → symbols/imports/exports by aggregatingfilePath -> chunk_idin<index>_locationsand joining chunk docs from<index>.symbol_analysis: finds candidate chunk ids in<index>, then aggregates per-file presence via<index>_locations, then joins to chunk docs for enrichment.semantic_code_search:<index>(semantic_text) to get a bounded candidate set ofchunk_ids.and/or/not).list_indices: computes file counts via<index>_locations(cardinality offilePath).How it works (ASCII)
Why this design (vs alternatives)
This PR is intentionally aligned to the indexer’s data model. The goal is correct results at scale, without reintroducing the behaviors that forced caps/timeouts in a single-index, location-aggregating design.
Alternatives considered:
<index>and “pretend” file metadata lives there<index>for MCP convenienceand-style KQL splits and drop corrector/notsemantics when location fields are involvedsemantic_code_searchsupport unbounded KQL-only queries that referencefilePathAbout complexity
This PR does add some complexity to MCP, but it’s the minimum necessary complexity imposed by the correct storage model:
In short: the MCP isn’t “getting fancy”; it’s querying the right source of truth and preserving correct semantics.
Drawbacks / mitigations
mgetchunks)mget) and only done for the ids needed.read_file_from_chunksemits an explicit warning header to make this visible.semantic_code_searchqueryto establish a bounded candidate set, or use tools that are naturally location-driven (map_symbols_by_query,discover_directories).Breaking
filePath/ aggregated file lists / per-file fields on chunk documents.Blocked
elastic/semantic-code-search-indexer#135(requires a reindex using the new mapping/model)Test plan
npm run buildnpm run lintnpm test