Skip to content

fix: adopt locations-first index model (_locations join)#36

Merged
kapral18 merged 1 commit intoelastic:mainfrom
kapral18:chore/adopt-new-filepath-format
Jan 13, 2026
Merged

fix: adopt locations-first index model (_locations join)#36
kapral18 merged 1 commit intoelastic:mainfrom
kapral18:chore/adopt-new-filepath-format

Conversation

@kapral18
Copy link
Collaborator

@kapral18 kapral18 commented Jan 1, 2026

Summary

This PR updates the MCP server to match the locations-first Elasticsearch storage model introduced in elastic/semantic-code-search-indexer#135.

Elasticsearch data model expected by this MCP server

Given a base index name <index>:

  • <index>: content-deduplicated chunk documents (semantic search + chunk-level metadata)
  • <index>_locations: one document per chunk occurrence (file path + line range + directory/git metadata) referencing the chunk by chunk_id

All file-level metadata is resolved from <index>_locations and joined back to <index> via chunk_id (typically using mget).

Problem (with indexer #135)

Indexer #135 makes a deliberate breaking change: chunk documents in <index> no longer store file-level fields (e.g. filePath, startLine, endLine, aggregated file lists).

Without this PR:

  • Tools that need file context (directory discovery, file reconstruction, filePath-filtered mapping) either fail outright or silently produce incomplete results.
  • KQL that references file-level fields cannot be evaluated correctly if the MCP only queries <index>.

Fix in this PR

This PR moves MCP’s file-level reads and KQL evaluation to the correct storage location (<index>_locations) while keeping chunk-level reads on <index>.

Key tool updates:

  • read_file_from_chunks: reconstructs files by querying <index>_locations for (filePath, startLine, endLine, chunk_id) and then fetching chunk content via mget from <index>.
    • Adds a small warning header if <index>_locations references chunk_ids missing from <index> (visibility into partial-failure / eventual-consistency scenarios).
  • discover_directories: discovers directories by aggregating on <index>_locations.directoryPath and enriching results using chunk docs from <index>.
  • map_symbols_by_query: maps file → symbols/imports/exports by aggregating filePath -> chunk_id in <index>_locations and joining chunk docs from <index>.
  • symbol_analysis: finds candidate chunk ids in <index>, then aggregates per-file presence via <index>_locations, then joins to chunk docs for enrichment.
  • semantic_code_search:
    • Searches <index> (semantic_text) to get a bounded candidate set of chunk_ids.
    • Evaluates KQL with correct boolean semantics across the split model within that bounded set (supports and/or/not).
    • Returns chunk hits plus a sample of matching locations.
  • list_indices: computes file counts via <index>_locations (cardinality of filePath).

How it works (ASCII)

read_file_from_chunks(filePaths[])
  -> search <index>_locations (filePath) => [(chunk_id, startLine, endLine)...]
  -> mget <index> (chunk_id) => chunk content
  -> sort + reconstruct

semantic_code_search(query, kql)
  -> search <index> (semantic_text [+ optional chunk-only KQL]) => candidate chunk_id[]
  -> evaluate full KQL across split storage within that universe:
     - chunk field clauses -> query <index> restricted to candidate ids
     - location field clauses -> query <index>_locations restricted to candidate ids
     - combine via set logic (and/or/not)
  -> fetch location samples from <index>_locations for resulting chunk_id[]
  -> return chunk hits + locations

map_symbols_by_query(kql)
  -> split KQL by storage (chunk vs location where possible)
  -> aggregate <index>_locations by filePath -> chunk_id (optionally restricted by chunk ids)
  -> mget <index> to enrich => grouped symbols/imports/exports per file

Why this design (vs alternatives)

This PR is intentionally aligned to the indexer’s data model. The goal is correct results at scale, without reintroducing the behaviors that forced caps/timeouts in a single-index, location-aggregating design.

Alternatives considered:

  1. Keep MCP querying only <index> and “pretend” file metadata lives there
  • Rejected: file-level fields are not present on chunk docs in the locations-first model.
  1. Teach the indexer to denormalize file-level fields back onto <index> for MCP convenience
  • Rejected: that reintroduces the scaling failure mode (hot, growing chunk docs) that the locations-first model is explicitly avoiding.
  1. Only support and-style KQL splits and drop correct or/not semantics when location fields are involved
  • Rejected: it produces incorrect filtering semantics (surprising to users) and is hard to detect from the outside.
  1. Try to make semantic_code_search support unbounded KQL-only queries that reference filePath
  • Rejected: without a bounded candidate universe, this becomes “scan everything” behavior. It is both expensive and unpredictable at large scale.

About complexity

This PR does add some complexity to MCP, but it’s the minimum necessary complexity imposed by the correct storage model:

  • The MCP’s user-facing tool behavior stays conceptually the same (same inputs/outputs, same mental model: “search chunks, map them to files”).
  • The additional complexity is localized to utilities that (a) split KQL by storage and (b) evaluate KQL across the two indices within a bounded candidate universe.
  • The alternative to handling this complexity in MCP is to push it back into Elasticsearch document shape (denormalization / aggregation), which is exactly what breaks at scale.

In short: the MCP isn’t “getting fancy”; it’s querying the right source of truth and preserving correct semantics.

Drawbacks / mitigations

  • Join required for some tools (locations → mget chunks)
    • Mitigation: joins are batched (mget) and only done for the ids needed.
  • Temporary inconsistency possible (locations reference a chunk that is missing)
    • Mitigation: read_file_from_chunks emits an explicit warning header to make this visible.
  • KQL-only queries with location fields are not allowed in semantic_code_search
    • Mitigation: callers can add a semantic query to establish a bounded candidate set, or use tools that are naturally location-driven (map_symbols_by_query, discover_directories).

Breaking

  • MCP tools no longer expect filePath / aggregated file lists / per-file fields on chunk documents.

Blocked

  • Blocked by: elastic/semantic-code-search-indexer#135 (requires a reindex using the new mapping/model)

Test plan

  • npm run build
  • npm run lint
  • npm test

Copilot AI review requested due to automatic review settings January 1, 2026 22:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the MCP server to support the new aggregated document structure introduced by indexer #135, where identical chunk content from multiple files is stored in a single Elasticsearch document with multiple file locations.

Key changes:

  • Modified read_file_from_chunks to use nested filePaths[] array for per-file line ranges instead of top-level fields
  • Updated discover_directories aggregation to query nested filePaths.directoryPath with reverse_nested for accurate directory discovery
  • Updated type definitions to reflect that filePath can now be multi-valued (string | string[])

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/mcp_server/read_file.test.ts New test file validating reconstruction from aggregated documents with nested filePaths
tests/mcp_server/discover_directories.test.ts Updated mock responses to match new nested aggregation structure
tests/elasticsearch/directory_discovery.test.ts Updated mock responses and query expectations for nested filePaths aggregation
src/utils/elasticsearch.ts Extended CodeChunk interface to support multi-valued filePath and nested filePaths array
src/mcp_server/tools/semantic_code_search.ts Updated filePath type to allow string or string[]
src/mcp_server/tools/semantic_code_search.md Added note about aggregated indices and multi-valued filePath
src/mcp_server/tools/read_file.ts Complete refactor to query and reconstruct files using nested filePaths array
src/mcp_server/tools/read_file.md Added note about aggregated indices
src/elasticsearch/directory_discovery.ts Updated aggregation to use nested filePaths.directoryPath with reverse_nested

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kapral18 kapral18 changed the title fix(mcp): support aggregated filePaths from indexer #135 fix!: adopt locations-first index model (_locations join) Jan 4, 2026
@kapral18 kapral18 force-pushed the chore/adopt-new-filepath-format branch 4 times, most recently from abee667 to 2e1a02d Compare January 4, 2026 19:33
@kapral18 kapral18 changed the title fix!: adopt locations-first index model (_locations join) fix: adopt locations-first index model (_locations join) Jan 4, 2026

This comment was marked as resolved.

@kapral18 kapral18 requested a review from Copilot January 5, 2026 12:02

This comment was marked as resolved.

@kapral18 kapral18 requested a review from Copilot January 5, 2026 12:05

This comment was marked as resolved.

- Update all tool + prompt docs for locations-first indices and KQL split behavior
- Clarify list_indices file counting uses <index>_locations and filePath filters are location-scoped
@kapral18 kapral18 force-pushed the chore/adopt-new-filepath-format branch from 2e1a02d to 251d9cc Compare January 5, 2026 12:22
Copy link
Member

@simianhacker simianhacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kapral18 added a commit to elastic/semantic-code-search-indexer that referenced this pull request Jan 13, 2026
## Summary

This PR fixes several queue/worker correctness issues and introduces a
**locations-first** Elasticsearch storage model.

### Elasticsearch data model (new)

Given a base index name `<index>`:

- `<index>`: **content-deduplicated chunk documents** (semantic search +
rich metadata)
- `<index>_locations`: **one document per chunk occurrence** (file path
+ line range + directory/git metadata)
- `<index>_settings`: per-index state (e.g. last indexed commit per
branch)

Chunk docs are keyed by a stable **`chunk_id`** (the document `_id` in
`<index>`). Each occurrence is keyed independently in
`<index>_locations` and references `chunk_id`.

## Fixes

- Fixes #121: documents with identical content overwrite each other
- **Problem on `main`**: bulk indexing uses `_id = chunk_hash` in
`<index>`. When two files produce the same `chunk_hash`, one document
overwrites the other, so only one file is discoverable.
- **Fix in this PR**: keep `<index>` as **content-deduplicated** chunk
docs (one per `chunk_id`) and store **all file occurrences** in
`<index>_locations` (one doc per occurrence). Consumers join locations →
chunks by `chunk_id`.

- Fixes #133: `SqliteError: too many SQL variables`
- **Problem on `main`**: large `IN (...)` operations during
commit/requeue/stale recovery exceed SQLite’s bound-parameter limit.
- **Fix in this PR**: batch queue DB operations into safe chunks so very
large batches and stale recovery don’t exceed SQLite limits.

- Fixes #134: IndexerWorker exits prematurely while in-flight tasks are
still running
- **Problem on `main`**: the worker can observe an empty queue and exit
while async indexing work is still in progress, leaving rows stuck.
- **Fix in this PR**: implement correct drain semantics: exit only when
dequeue is empty **and** there are no active in-flight tasks.

- Fixes #136: duplicate `chunk_hash` can leave rows stuck in processing
- **Problem on `main`**: partial bulk failures can’t be mapped reliably
back to queued documents when multiple inputs share the same
`chunk_hash`.
- **Fix in this PR**: map bulk results back to input queue rows by
**input index**, and defensively requeue any remaining `processing` rows
if a batch throws.

## How it works (complete, code-level mental model)

### 0) Identifiers (what is stable and why)

**Chunk document id (`chunk_id`)**

- `chunk_id` is the Elasticsearch `_id` used in `<index>`.
- It is computed as a SHA256 over **content identity only**:

```text
chunk_id = sha256(type + ":" + language + ":" + (kind||"") + ":" + (containerPath||"") + ":" + content)
```

Important invariants:
- It intentionally **does not include** file-specific metadata (path,
branch, line numbers).
- Identical content in multiple files maps to the same `chunk_id`.

**Location document id (idempotency key)**

- Each occurrence is stored as one doc in `<index>_locations`.
- The `_id` is another stable SHA256 so retries overwrite the same doc
(no duplicates):

```text
location_id = sha256(chunk_id + ":" + filePath + ":" + startLine + ":" + endLine + ":" + (git_branch||""))
```

### 1) Index creation / lifecycle

- Full index creates: `<index>`, `<index>_settings`,
`<index>_locations`.
- Incremental index ensures `<index>_locations` exists before doing
deletes/enqueues (so existing deployments can upgrade without requiring
a clean reindex just to create the new index).
- Standalone worker also ensures `<index>` and `<index>_locations` exist
before dequeuing work (so it can’t leave rows stuck in `processing` due
to missing indices).

### 2) Indexing flow (two-phase write)

```text
Parser -> sqlite queue.db -> IndexerWorker -> Elasticsearch

For each input chunk occurrence (one queued row):
  - compute chunk_id (content identity)
  - create chunk doc in <index> (id=chunk_id)
  - create/overwrite location doc in <index>_locations (id=location_id)
```

**2.1 Chunk docs (`<index>`) are written via bulk `create`**

- The worker groups input chunks by `chunk_id` (content dedupe).
- It sends a bulk request with `create` operations for those ids.
- `409` conflicts are treated as success (expected when the chunk
already exists).

Why `create` (not `index`/`update`):
- Writing the same chunk doc again can cause re-inference / unnecessary
work for `semantic_text`.
- `create` makes “first writer wins” explicit; later occurrences don’t
mutate chunk docs.

**2.2 Location docs (`<index>_locations`) are written via bulk `index`**

- One location doc per *occurrence* (filePath + range + metadata).
- Uses `index` (overwrite) with the stable `location_id` so retries are
idempotent.
- If the exact same `location_id` appears multiple times within a single
batch (e.g. duplicate queue rows), it is de-duplicated and the error (if
any) is mapped to all affected inputs.

### 3) Failure semantics (what gets retried)

Indexing returns a per-input result:
- `succeeded[]`: input rows that were fully persisted (chunk doc exists
or already existed **and** location doc was indexed)
- `failed[]`: input rows that hit an Elasticsearch error on either phase

Key behaviors:
- If a chunk `create` item fails for a given `chunk_id`, **all inputs**
that map to that `chunk_id` are marked failed for that batch.
- Location indexing is skipped for any input already marked failed in
chunk phase.
- If a location `index` item fails for a given `location_id`, **all
inputs** that map to that `location_id` are marked failed.
- If the entire bulk request throws (network/cluster failure), all
affected inputs are marked failed.

The worker uses these results to:
- commit successful queue rows
- requeue failed queue rows (with retry tracking)

### 4) Read/query patterns (how consumers should query)

**Semantic search**
- Query `<index>.semantic_text`.
- Results are chunk docs (`chunk_id` is the hit `_id`).

**File reconstruction / file-level filtering**
- Query `<index>_locations` by `filePath` (and optionally
`directoryPath`, line ranges, branch).
- Join to chunk docs via `mget` on `<index>` using `chunk_id`.

**Directory aggregations**
- Aggregate on `<index>_locations.directoryPath`.
- If you need language/kind/symbol breakdowns, join the top `chunk_id`s
back to `<index>`.

### 5) Incremental deletion (changed/deleted files)

```text
changed/deleted file paths
  -> scan <index>_locations for those filePath values (PIT + search_after + _shard_doc)
  -> bulk delete matching location docs
  -> refresh <index>_locations (visibility for orphan check)
  -> collect affected chunk_id values
  -> delete orphan chunk docs in <index> (chunk_id with zero remaining locations)
```

Implementation details:
- Location deletion uses PIT + `_shard_doc` to paginate safely and keep
memory bounded.
- After deleting location docs, the index is refreshed so the orphan
check is correct.
- Orphan cleanup checks which `chunk_id`s still exist in
`<index>_locations` via a terms aggregation, then bulk deletes chunk
docs for ids that are absent.

## Why this design (vs alternatives)

This PR needs to solve two constraints at once:

- **Correctness at scale**: a chunk can appear in many files; we must
not lose occurrences.
- **Operational stability**: writes/updates must stay predictable on
very large repos.

Alternatives we considered:

1) **Keep one doc per occurrence in `<index>`** (e.g. make `_id =
hash(filePath + content)`)
- **Pros**: single-index, simple read model.
- **Cons**:
- duplicates `semantic_text` / vectors for the same content (index
bloat)
- increases semantic inference work proportionally to number of
occurrences
  - makes “show me the canonical chunk for this content” harder

2) **Store all occurrences on the chunk doc** (e.g. keep
`filePaths[]`/`fileCount` on `<index>` and append/merge)
- **Pros**: single query can return chunk + its locations.
- **Cons**:
  - produces write hot-spots and very large documents for popular chunks
  - update contention (multiple writers touching the same doc)
  - forces artificial guardrails (caps) to keep documents bounded

3) **Store occurrences in `<index>_settings` instead of a dedicated
index**
- **Pros**: avoids introducing a new index name.
- **Cons**:
- `<index>_settings` is intended to be a small, low-churn state index
(e.g. commit hashes)
- mixing high-cardinality, high-write occurrence documents into it
increases blast radius
  - prevents tuning lifecycle/refresh/retention separately from settings

4) **Use join/parent-child or nested “occurrence docs” inside
`<index>`**
- **Cons**: more complex mappings/queries and still ties occurrence
scaling to the primary index.

Why `<index>_locations` wins here:

- **Unbounded locations without mega-documents** (one occurrence = one
doc).
- **Predictable write behavior**: no single chunk doc grows with repo
size.
- **Independent operational control**: `<index>_locations` can be
refreshed/queried/retained independently of settings and chunk content.

## Drawbacks of this solution (and follow-up plan)

This two-index model introduces real trade-offs:

- **Synchronization surface**: chunk writes and location writes can
partially fail.
- **Query joins**: some tools require a two-step query (locations →
`mget` chunks).
- **Migration complexity**: introducing a new index name requires
careful cutovers.

Follow-up in #132 (alias-first architecture) will improve these:

- Use **versioned indices** (e.g. `<alias> -> <index>_vN`,
`<alias>_locations -> <index>_locations_vN`) with **atomic alias
cutover**, enabling safer migrations and reindexes.
- Make it easier to run **backfill / repair** strategies when one side
(chunks vs locations) is temporarily out of sync.
- Reduce downtime risk by allowing **parallel build + switch** instead
of in-place mutation.

## Breaking change (clean reindex required)

- `<index>` no longer stores file-level fields (`filePath`, directory
fields, `startLine`/`endLine`, `git_*`).
- File-level metadata lives exclusively in `<index>_locations`.
- A clean reindex is required when migrating existing indices.

## MCP impact

- Companion MCP PR (required):
elastic/semantic-code-search-mcp-server#36

## Test plan

- [x] `npm run test:unit`
- [x] `npm run test:integration`

## Follow-up

- Alias-first / zero-downtime migration architecture: #132
@kapral18 kapral18 merged commit b948f17 into elastic:main Jan 13, 2026
3 checks passed
@kapral18 kapral18 deleted the chore/adopt-new-filepath-format branch January 13, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants