fix: stabilize queue and move file locations to _locations index#135
Conversation
3b47fee to
f7fb1da
Compare
1aef7ff to
64ccd34
Compare
simianhacker
left a comment
There was a problem hiding this comment.
I get this error when I try and index locally:
[2025-12-16T21:19:16.202Z] [INFO] Bulk operation completed for 1 chunks
[2025-12-16T21:19:16.202Z] [ERROR] Partial bulk failure: 1/1 documents failed {"errors":"[\n {\n \"chunk_hash\": \"55ef41cf5036f0991e2febeca14e4663ddac04a20bd76ac584cb6cfe9d314c6f\",\n \"inputIndex\": 0,\n \"error\": {\n \"type\": \"status_exception\",\n \"reason\": \"Cannot apply update with a script on indices that contain [semantic_text] field(s)\"\n }\n }\n]"}
|
I'm a little uneasy about the |
|
@simianhacker sorry this wasn't done yet, the painless script approach was an iteration. I should've marked this as draft |
dee9298 to
fe3d091
Compare
…covery
- Reconstruct files using nested filePaths[].{path,startLine,endLine}
- Aggregate directories via nested filePaths.directoryPath for correct discovery
- Widen filePath typing to string | string[]
- Add tests covering aggregated index behavior
Blocked by: elastic/semantic-code-search-indexer#135
|
Hey @simianhacker the PR has changed a lot since your review I also updated the description. Check please, especially the I also added the complementary change in elastic/semantic-code-search-mcp-server#36. You will need it to be able to test end2end on mcp side. I tested locally and again everything checks out so far. This architcture to me is a perfect long-term middle-ground that plays really well into our next change with alias first architecture. Can you give this new structure a try with clean reindex and share your thoughts/approve? |
d7d7f4e to
ef4d1a6
Compare
ef4d1a6 to
a154f88
Compare
a154f88 to
0a326c4
Compare
0a326c4 to
f26da79
Compare
f26da79 to
7bd99d6
Compare
- Store file paths and line ranges in <index>_locations (one doc per occurrence) - Keep <index> as content-deduplicated chunk docs; remove filePaths/fileCount from primary mapping - Rewrite indexing + deletion flows to join via chunk_id and clean up orphan chunk docs BREAKING CHANGE: primary chunk documents no longer store file-level metadata; clean reindex required.
7bd99d6 to
6c79e8f
Compare

Summary
This PR fixes several queue/worker correctness issues and introduces a locations-first Elasticsearch storage model.
Elasticsearch data model (new)
Given a base index name
<index>:<index>: content-deduplicated chunk documents (semantic search + rich metadata)<index>_locations: one document per chunk occurrence (file path + line range + directory/git metadata)<index>_settings: per-index state (e.g. last indexed commit per branch)Chunk docs are keyed by a stable
chunk_id(the document_idin<index>). Each occurrence is keyed independently in<index>_locationsand referenceschunk_id.Fixes
Fixes bug: documents with identical content overwrite each other - only one file discoverable via search #121: documents with identical content overwrite each other
main: bulk indexing uses_id = chunk_hashin<index>. When two files produce the samechunk_hash, one document overwrites the other, so only one file is discoverable.<index>as content-deduplicated chunk docs (one perchunk_id) and store all file occurrences in<index>_locations(one doc per occurrence). Consumers join locations → chunks bychunk_id.Fixes bug: SqliteError "too many SQL variables" during stale task recovery and large batch operations #133:
SqliteError: too many SQL variablesmain: largeIN (...)operations during commit/requeue/stale recovery exceed SQLite’s bound-parameter limit.Fixes bug: IndexerWorker exits prematurely while in-flight tasks are still processing, leaving queue stuck #134: IndexerWorker exits prematurely while in-flight tasks are still running
main: the worker can observe an empty queue and exit while async indexing work is still in progress, leaving rows stuck.Fixes bug: duplicate chunk_hash can leave SqliteQueue rows stuck in processing #136: duplicate
chunk_hashcan leave rows stuck in processingmain: partial bulk failures can’t be mapped reliably back to queued documents when multiple inputs share the samechunk_hash.processingrows if a batch throws.How it works (complete, code-level mental model)
0) Identifiers (what is stable and why)
Chunk document id (
chunk_id)chunk_idis the Elasticsearch_idused in<index>.Important invariants:
chunk_id.Location document id (idempotency key)
<index>_locations._idis another stable SHA256 so retries overwrite the same doc (no duplicates):1) Index creation / lifecycle
<index>,<index>_settings,<index>_locations.<index>_locationsexists before doing deletes/enqueues (so existing deployments can upgrade without requiring a clean reindex just to create the new index).<index>and<index>_locationsexist before dequeuing work (so it can’t leave rows stuck inprocessingdue to missing indices).2) Indexing flow (two-phase write)
2.1 Chunk docs (
<index>) are written via bulkcreatechunk_id(content dedupe).createoperations for those ids.409conflicts are treated as success (expected when the chunk already exists).Why
create(notindex/update):semantic_text.createmakes “first writer wins” explicit; later occurrences don’t mutate chunk docs.2.2 Location docs (
<index>_locations) are written via bulkindexindex(overwrite) with the stablelocation_idso retries are idempotent.location_idappears multiple times within a single batch (e.g. duplicate queue rows), it is de-duplicated and the error (if any) is mapped to all affected inputs.3) Failure semantics (what gets retried)
Indexing returns a per-input result:
succeeded[]: input rows that were fully persisted (chunk doc exists or already existed and location doc was indexed)failed[]: input rows that hit an Elasticsearch error on either phaseKey behaviors:
createitem fails for a givenchunk_id, all inputs that map to thatchunk_idare marked failed for that batch.indexitem fails for a givenlocation_id, all inputs that map to thatlocation_idare marked failed.The worker uses these results to:
4) Read/query patterns (how consumers should query)
Semantic search
<index>.semantic_text.chunk_idis the hit_id).File reconstruction / file-level filtering
<index>_locationsbyfilePath(and optionallydirectoryPath, line ranges, branch).mgeton<index>usingchunk_id.Directory aggregations
<index>_locations.directoryPath.chunk_ids back to<index>.5) Incremental deletion (changed/deleted files)
Implementation details:
_shard_docto paginate safely and keep memory bounded.chunk_ids still exist in<index>_locationsvia a terms aggregation, then bulk deletes chunk docs for ids that are absent.Why this design (vs alternatives)
This PR needs to solve two constraints at once:
Alternatives we considered:
<index>(e.g. make_id = hash(filePath + content))semantic_text/ vectors for the same content (index bloat)filePaths[]/fileCounton<index>and append/merge)<index>_settingsinstead of a dedicated index<index>_settingsis intended to be a small, low-churn state index (e.g. commit hashes)<index>Why
<index>_locationswins here:<index>_locationscan be refreshed/queried/retained independently of settings and chunk content.Drawbacks of this solution (and follow-up plan)
This two-index model introduces real trade-offs:
mgetchunks).Follow-up in #132 (alias-first architecture) will improve these:
<alias> -> <index>_vN,<alias>_locations -> <index>_locations_vN) with atomic alias cutover, enabling safer migrations and reindexes.Breaking change (clean reindex required)
<index>no longer stores file-level fields (filePath, directory fields,startLine/endLine,git_*).<index>_locations.MCP impact
Test plan
npm run test:unitnpm run test:integrationFollow-up