fix: adopt locations-first index model (_locations join) by kapral18 · Pull Request #36 · elastic/semantic-code-search-mcp-server

kapral18 · 2026-01-01T22:36:46Z

Summary

This PR updates the MCP server to match the locations-first Elasticsearch storage model introduced in elastic/semantic-code-search-indexer#135.

Elasticsearch data model expected by this MCP server

Given a base index name <index>:

<index>: content-deduplicated chunk documents (semantic search + chunk-level metadata)
<index>_locations: one document per chunk occurrence (file path + line range + directory/git metadata) referencing the chunk by chunk_id

All file-level metadata is resolved from <index>_locations and joined back to <index> via chunk_id (typically using mget).

Problem (with indexer #135)

Indexer #135 makes a deliberate breaking change: chunk documents in <index> no longer store file-level fields (e.g. filePath, startLine, endLine, aggregated file lists).

Without this PR:

Tools that need file context (directory discovery, file reconstruction, filePath-filtered mapping) either fail outright or silently produce incomplete results.
KQL that references file-level fields cannot be evaluated correctly if the MCP only queries <index>.

Fix in this PR

This PR moves MCP’s file-level reads and KQL evaluation to the correct storage location (<index>_locations) while keeping chunk-level reads on <index>.

Key tool updates:

read_file_from_chunks: reconstructs files by querying <index>_locations for (filePath, startLine, endLine, chunk_id) and then fetching chunk content via mget from <index>.
- Adds a small warning header if <index>_locations references chunk_ids missing from <index> (visibility into partial-failure / eventual-consistency scenarios).
discover_directories: discovers directories by aggregating on <index>_locations.directoryPath and enriching results using chunk docs from <index>.
map_symbols_by_query: maps file → symbols/imports/exports by aggregating filePath -> chunk_id in <index>_locations and joining chunk docs from <index>.
symbol_analysis: finds candidate chunk ids in <index>, then aggregates per-file presence via <index>_locations, then joins to chunk docs for enrichment.
semantic_code_search:
- Searches <index> (semantic_text) to get a bounded candidate set of chunk_ids.
- Evaluates KQL with correct boolean semantics across the split model within that bounded set (supports and/or/not).
- Returns chunk hits plus a sample of matching locations.
list_indices: computes file counts via <index>_locations (cardinality of filePath).

How it works (ASCII)

read_file_from_chunks(filePaths[])
  -> search <index>_locations (filePath) => [(chunk_id, startLine, endLine)...]
  -> mget <index> (chunk_id) => chunk content
  -> sort + reconstruct

semantic_code_search(query, kql)
  -> search <index> (semantic_text [+ optional chunk-only KQL]) => candidate chunk_id[]
  -> evaluate full KQL across split storage within that universe:
     - chunk field clauses -> query <index> restricted to candidate ids
     - location field clauses -> query <index>_locations restricted to candidate ids
     - combine via set logic (and/or/not)
  -> fetch location samples from <index>_locations for resulting chunk_id[]
  -> return chunk hits + locations

map_symbols_by_query(kql)
  -> split KQL by storage (chunk vs location where possible)
  -> aggregate <index>_locations by filePath -> chunk_id (optionally restricted by chunk ids)
  -> mget <index> to enrich => grouped symbols/imports/exports per file

Why this design (vs alternatives)

This PR is intentionally aligned to the indexer’s data model. The goal is correct results at scale, without reintroducing the behaviors that forced caps/timeouts in a single-index, location-aggregating design.

Alternatives considered:

Keep MCP querying only <index> and “pretend” file metadata lives there

Rejected: file-level fields are not present on chunk docs in the locations-first model.

Teach the indexer to denormalize file-level fields back onto <index> for MCP convenience

Rejected: that reintroduces the scaling failure mode (hot, growing chunk docs) that the locations-first model is explicitly avoiding.

Only support and-style KQL splits and drop correct or/not semantics when location fields are involved

Rejected: it produces incorrect filtering semantics (surprising to users) and is hard to detect from the outside.

Try to make semantic_code_search support unbounded KQL-only queries that reference filePath

Rejected: without a bounded candidate universe, this becomes “scan everything” behavior. It is both expensive and unpredictable at large scale.

About complexity

This PR does add some complexity to MCP, but it’s the minimum necessary complexity imposed by the correct storage model:

The MCP’s user-facing tool behavior stays conceptually the same (same inputs/outputs, same mental model: “search chunks, map them to files”).
The additional complexity is localized to utilities that (a) split KQL by storage and (b) evaluate KQL across the two indices within a bounded candidate universe.
The alternative to handling this complexity in MCP is to push it back into Elasticsearch document shape (denormalization / aggregation), which is exactly what breaks at scale.

In short: the MCP isn’t “getting fancy”; it’s querying the right source of truth and preserving correct semantics.

Drawbacks / mitigations

Join required for some tools (locations → mget chunks)
- Mitigation: joins are batched (mget) and only done for the ids needed.
Temporary inconsistency possible (locations reference a chunk that is missing)
- Mitigation: read_file_from_chunks emits an explicit warning header to make this visible.
KQL-only queries with location fields are not allowed in semantic_code_search
- Mitigation: callers can add a semantic query to establish a bounded candidate set, or use tools that are naturally location-driven (map_symbols_by_query, discover_directories).

Breaking

MCP tools no longer expect filePath / aggregated file lists / per-file fields on chunk documents.

Blocked

Blocked by: elastic/semantic-code-search-indexer#135 (requires a reindex using the new mapping/model)

Test plan

npm run build
npm run lint
npm test

Copilot

Pull request overview

This PR updates the MCP server to support the new aggregated document structure introduced by indexer #135, where identical chunk content from multiple files is stored in a single Elasticsearch document with multiple file locations.

Key changes:

Modified read_file_from_chunks to use nested filePaths[] array for per-file line ranges instead of top-level fields
Updated discover_directories aggregation to query nested filePaths.directoryPath with reverse_nested for accurate directory discovery
Updated type definitions to reflect that filePath can now be multi-valued (string | string[])

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/mcp_server/read_file.test.ts	New test file validating reconstruction from aggregated documents with nested filePaths
tests/mcp_server/discover_directories.test.ts	Updated mock responses to match new nested aggregation structure
tests/elasticsearch/directory_discovery.test.ts	Updated mock responses and query expectations for nested filePaths aggregation
src/utils/elasticsearch.ts	Extended CodeChunk interface to support multi-valued filePath and nested filePaths array
src/mcp_server/tools/semantic_code_search.ts	Updated filePath type to allow string or string[]
src/mcp_server/tools/semantic_code_search.md	Added note about aggregated indices and multi-valued filePath
src/mcp_server/tools/read_file.ts	Complete refactor to query and reconstruct files using nested filePaths array
src/mcp_server/tools/read_file.md	Added note about aggregated indices
src/elasticsearch/directory_discovery.ts	Updated aggregation to use nested filePaths.directoryPath with reverse_nested

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Update all tool + prompt docs for locations-first indices and KQL split behavior - Clarify list_indices file counting uses <index>_locations and filePath filters are location-scoped

simianhacker

LGTM!

https://claude.ai/share/e155d737-5d09-48b3-a2bd-d4fd27b22a9d

## Summary This PR fixes several queue/worker correctness issues and introduces a **locations-first** Elasticsearch storage model. ### Elasticsearch data model (new) Given a base index name `<index>`: - `<index>`: **content-deduplicated chunk documents** (semantic search + rich metadata) - `<index>_locations`: **one document per chunk occurrence** (file path + line range + directory/git metadata) - `<index>_settings`: per-index state (e.g. last indexed commit per branch) Chunk docs are keyed by a stable **`chunk_id`** (the document `_id` in `<index>`). Each occurrence is keyed independently in `<index>_locations` and references `chunk_id`. ## Fixes - Fixes #121: documents with identical content overwrite each other - **Problem on `main`**: bulk indexing uses `_id = chunk_hash` in `<index>`. When two files produce the same `chunk_hash`, one document overwrites the other, so only one file is discoverable. - **Fix in this PR**: keep `<index>` as **content-deduplicated** chunk docs (one per `chunk_id`) and store **all file occurrences** in `<index>_locations` (one doc per occurrence). Consumers join locations → chunks by `chunk_id`. - Fixes #133: `SqliteError: too many SQL variables` - **Problem on `main`**: large `IN (...)` operations during commit/requeue/stale recovery exceed SQLite’s bound-parameter limit. - **Fix in this PR**: batch queue DB operations into safe chunks so very large batches and stale recovery don’t exceed SQLite limits. - Fixes #134: IndexerWorker exits prematurely while in-flight tasks are still running - **Problem on `main`**: the worker can observe an empty queue and exit while async indexing work is still in progress, leaving rows stuck. - **Fix in this PR**: implement correct drain semantics: exit only when dequeue is empty **and** there are no active in-flight tasks. - Fixes #136: duplicate `chunk_hash` can leave rows stuck in processing - **Problem on `main`**: partial bulk failures can’t be mapped reliably back to queued documents when multiple inputs share the same `chunk_hash`. - **Fix in this PR**: map bulk results back to input queue rows by **input index**, and defensively requeue any remaining `processing` rows if a batch throws. ## How it works (complete, code-level mental model) ### 0) Identifiers (what is stable and why) **Chunk document id (`chunk_id`)** - `chunk_id` is the Elasticsearch `_id` used in `<index>`. - It is computed as a SHA256 over **content identity only**: ```text chunk_id = sha256(type + ":" + language + ":" + (kind||"") + ":" + (containerPath||"") + ":" + content) ``` Important invariants: - It intentionally **does not include** file-specific metadata (path, branch, line numbers). - Identical content in multiple files maps to the same `chunk_id`. **Location document id (idempotency key)** - Each occurrence is stored as one doc in `<index>_locations`. - The `_id` is another stable SHA256 so retries overwrite the same doc (no duplicates): ```text location_id = sha256(chunk_id + ":" + filePath + ":" + startLine + ":" + endLine + ":" + (git_branch||"")) ``` ### 1) Index creation / lifecycle - Full index creates: `<index>`, `<index>_settings`, `<index>_locations`. - Incremental index ensures `<index>_locations` exists before doing deletes/enqueues (so existing deployments can upgrade without requiring a clean reindex just to create the new index). - Standalone worker also ensures `<index>` and `<index>_locations` exist before dequeuing work (so it can’t leave rows stuck in `processing` due to missing indices). ### 2) Indexing flow (two-phase write) ```text Parser -> sqlite queue.db -> IndexerWorker -> Elasticsearch For each input chunk occurrence (one queued row): - compute chunk_id (content identity) - create chunk doc in <index> (id=chunk_id) - create/overwrite location doc in <index>_locations (id=location_id) ``` **2.1 Chunk docs (`<index>`) are written via bulk `create`** - The worker groups input chunks by `chunk_id` (content dedupe). - It sends a bulk request with `create` operations for those ids. - `409` conflicts are treated as success (expected when the chunk already exists). Why `create` (not `index`/`update`): - Writing the same chunk doc again can cause re-inference / unnecessary work for `semantic_text`. - `create` makes “first writer wins” explicit; later occurrences don’t mutate chunk docs. **2.2 Location docs (`<index>_locations`) are written via bulk `index`** - One location doc per *occurrence* (filePath + range + metadata). - Uses `index` (overwrite) with the stable `location_id` so retries are idempotent. - If the exact same `location_id` appears multiple times within a single batch (e.g. duplicate queue rows), it is de-duplicated and the error (if any) is mapped to all affected inputs. ### 3) Failure semantics (what gets retried) Indexing returns a per-input result: - `succeeded[]`: input rows that were fully persisted (chunk doc exists or already existed **and** location doc was indexed) - `failed[]`: input rows that hit an Elasticsearch error on either phase Key behaviors: - If a chunk `create` item fails for a given `chunk_id`, **all inputs** that map to that `chunk_id` are marked failed for that batch. - Location indexing is skipped for any input already marked failed in chunk phase. - If a location `index` item fails for a given `location_id`, **all inputs** that map to that `location_id` are marked failed. - If the entire bulk request throws (network/cluster failure), all affected inputs are marked failed. The worker uses these results to: - commit successful queue rows - requeue failed queue rows (with retry tracking) ### 4) Read/query patterns (how consumers should query) **Semantic search** - Query `<index>.semantic_text`. - Results are chunk docs (`chunk_id` is the hit `_id`). **File reconstruction / file-level filtering** - Query `<index>_locations` by `filePath` (and optionally `directoryPath`, line ranges, branch). - Join to chunk docs via `mget` on `<index>` using `chunk_id`. **Directory aggregations** - Aggregate on `<index>_locations.directoryPath`. - If you need language/kind/symbol breakdowns, join the top `chunk_id`s back to `<index>`. ### 5) Incremental deletion (changed/deleted files) ```text changed/deleted file paths -> scan <index>_locations for those filePath values (PIT + search_after + _shard_doc) -> bulk delete matching location docs -> refresh <index>_locations (visibility for orphan check) -> collect affected chunk_id values -> delete orphan chunk docs in <index> (chunk_id with zero remaining locations) ``` Implementation details: - Location deletion uses PIT + `_shard_doc` to paginate safely and keep memory bounded. - After deleting location docs, the index is refreshed so the orphan check is correct. - Orphan cleanup checks which `chunk_id`s still exist in `<index>_locations` via a terms aggregation, then bulk deletes chunk docs for ids that are absent. ## Why this design (vs alternatives) This PR needs to solve two constraints at once: - **Correctness at scale**: a chunk can appear in many files; we must not lose occurrences. - **Operational stability**: writes/updates must stay predictable on very large repos. Alternatives we considered: 1) **Keep one doc per occurrence in `<index>`** (e.g. make `_id = hash(filePath + content)`) - **Pros**: single-index, simple read model. - **Cons**: - duplicates `semantic_text` / vectors for the same content (index bloat) - increases semantic inference work proportionally to number of occurrences - makes “show me the canonical chunk for this content” harder 2) **Store all occurrences on the chunk doc** (e.g. keep `filePaths[]`/`fileCount` on `<index>` and append/merge) - **Pros**: single query can return chunk + its locations. - **Cons**: - produces write hot-spots and very large documents for popular chunks - update contention (multiple writers touching the same doc) - forces artificial guardrails (caps) to keep documents bounded 3) **Store occurrences in `<index>_settings` instead of a dedicated index** - **Pros**: avoids introducing a new index name. - **Cons**: - `<index>_settings` is intended to be a small, low-churn state index (e.g. commit hashes) - mixing high-cardinality, high-write occurrence documents into it increases blast radius - prevents tuning lifecycle/refresh/retention separately from settings 4) **Use join/parent-child or nested “occurrence docs” inside `<index>`** - **Cons**: more complex mappings/queries and still ties occurrence scaling to the primary index. Why `<index>_locations` wins here: - **Unbounded locations without mega-documents** (one occurrence = one doc). - **Predictable write behavior**: no single chunk doc grows with repo size. - **Independent operational control**: `<index>_locations` can be refreshed/queried/retained independently of settings and chunk content. ## Drawbacks of this solution (and follow-up plan) This two-index model introduces real trade-offs: - **Synchronization surface**: chunk writes and location writes can partially fail. - **Query joins**: some tools require a two-step query (locations → `mget` chunks). - **Migration complexity**: introducing a new index name requires careful cutovers. Follow-up in #132 (alias-first architecture) will improve these: - Use **versioned indices** (e.g. `<alias> -> <index>_vN`, `<alias>_locations -> <index>_locations_vN`) with **atomic alias cutover**, enabling safer migrations and reindexes. - Make it easier to run **backfill / repair** strategies when one side (chunks vs locations) is temporarily out of sync. - Reduce downtime risk by allowing **parallel build + switch** instead of in-place mutation. ## Breaking change (clean reindex required) - `<index>` no longer stores file-level fields (`filePath`, directory fields, `startLine`/`endLine`, `git_*`). - File-level metadata lives exclusively in `<index>_locations`. - A clean reindex is required when migrating existing indices. ## MCP impact - Companion MCP PR (required): elastic/semantic-code-search-mcp-server#36 ## Test plan - [x] `npm run test:unit` - [x] `npm run test:integration` ## Follow-up - Alias-first / zero-downtime migration architecture: #132

Copilot AI review requested due to automatic review settings January 1, 2026 22:36

Copilot AI reviewed Jan 1, 2026

View reviewed changes

kapral18 mentioned this pull request Jan 1, 2026

fix: stabilize queue and move file locations to _locations index elastic/semantic-code-search-indexer#135

Merged

2 tasks

kapral18 changed the title ~~fix(mcp): support aggregated filePaths from indexer #135~~ fix!: adopt locations-first index model (_locations join) Jan 4, 2026

kapral18 force-pushed the chore/adopt-new-filepath-format branch 4 times, most recently from abee667 to 2e1a02d Compare January 4, 2026 19:33

kapral18 changed the title ~~fix!: adopt locations-first index model (_locations join)~~ fix: adopt locations-first index model (_locations join) Jan 4, 2026

kapral18 requested review from Coolomina, Copilot and simianhacker January 5, 2026 10:11

This comment was marked as resolved.

Sign in to view

kapral18 requested a review from Copilot January 5, 2026 12:02

This comment was marked as resolved.

Sign in to view

kapral18 requested a review from Copilot January 5, 2026 12:05

This comment was marked as resolved.

Sign in to view

fix!: adopt locations-first index model (_locations join)

251d9cc

- Update all tool + prompt docs for locations-first indices and KQL split behavior - Clarify list_indices file counting uses <index>_locations and filePath filters are location-scoped

kapral18 force-pushed the chore/adopt-new-filepath-format branch from 2e1a02d to 251d9cc Compare January 5, 2026 12:22

simianhacker approved these changes Jan 13, 2026

View reviewed changes

kapral18 merged commit b948f17 into elastic:main Jan 13, 2026
3 checks passed

kapral18 deleted the chore/adopt-new-filepath-format branch January 13, 2026 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: adopt locations-first index model (_locations join)#36

fix: adopt locations-first index model (_locations join)#36
kapral18 merged 1 commit intoelastic:mainfrom
kapral18:chore/adopt-new-filepath-format

kapral18 commented Jan 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

simianhacker left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kapral18 commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Elasticsearch data model expected by this MCP server

Problem (with indexer #135)

Fix in this PR

How it works (ASCII)

Why this design (vs alternatives)

About complexity

Drawbacks / mitigations

Breaking

Blocked

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

simianhacker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kapral18 commented Jan 1, 2026 •

edited

Loading