Indexing leaks ~9 MB anon-rss per chunk in single process; OOM-kills mid-corpus on 15 GB host

**Bug**: indexing a multi-MB markdown corpus with the ONNX provider walks anon-rss to ~11.5 GB and the kernel OOM-kills the process partway through. Memory growth is linear in chunks indexed within a single process; per-process restart releases memory normally, so the leak is in long-lived state inside `MemSearch`/`pymilvus`/`milvus-lite` interaction rather than ONNX itself.

The naïve "split work into separate `memsearch index <file>` calls" workaround is sabotaged by `MemSearch.index()`'s implicit stale-source cleanup (`delete_by_source` for any source not in the just-passed paths) — per-file CLI calls treat each other's files as deleted and wipe them.

### Environment
- memsearch 0.4.1 (pip)
- pymilvus 2.5.x (whatever 0.4.1 pulls in), milvus-lite from same constraint
- onnxruntime 1.24.4, model `gpahal/bge-m3-onnx-int8` (the plugin default)
- Python 3.13.x in a venv
- Linux 6.12 / 4-vCPU AMD EPYC 9354P / **15 GB RAM, no swap**
- Corpus: ~518 markdown files, ~7 MB total, ~3000 expected chunks at `max_chunk_size=4000` chars

### Reproduction
```bash
memsearch index --batch-size 4 --max-chunk-size 4000 --force <full-corpus-paths>
```

Equivalent in code: a single `MemSearch.index()` call against the full corpus.

### Observed
| Run | `--batch-size` | `--max-chunk-size` | ONNX arena | Chunks before OOM | Peak anon-rss | Peak total-vm |
|---|---|---|---|---|---|---|
| 1 | 8    | 4000 | enabled  | ~80   | 11.85 GB | 17.97 GB |
| 2 | 4    | 4000 | enabled  | ~628  | 11.55 GB | 18.77 GB |
| 3 | 4    | 4000 | **disabled** (`SessionOptions.enable_cpu_mem_arena=False`) | ~628  | 11.53 GB | 12.94 GB |
| 4 | 4    | 2000 | enabled  | ~1186 | 11.55 GB | 18.77 GB |

Kernel log:
```
kernel: memsearch invoked oom-killer: gfp_mask=0x140cca, order=0, oom_score_adj=0
kernel: Out of memory: Killed process <pid> (memsearch) total-vm:18756488kB, anon-rss:11587564kB,
        file-rss:156kB, shmem-rss:0kB, UID:1001 pgtables:23552kB oom_score_adj:0
```

Per-chunk leak ≈ `(peak_anon_rss − baseline) / chunks_processed`:
- run 2: ≈ 16 MB/chunk
- run 4: ≈ 8.4 MB/chunk

Per-chunk leak halves with chunk size halved → some of it is working-set memory (tokenizer / batch pad / activations). But disabling the ONNX CPU mem arena (run 3) does **not** materially reduce peak anon-rss — only `total-vm` drops by ~6 GB. So the dominant leaked bytes are not in the ONNX session arena.

### Per-file probe (instrumented `index_file` loop in one process)
Indexed `Log/DECISIONS.md` (934 KB → 541 chunks) as the first file:

```
start RSS = 1153 MB
[1/518] +541 chunks  rss = 6064 MB  Log/DECISIONS.md
[2/518] + 39 chunks  rss = 6064 MB  Log/OC-RETIREMENT.md
```

A single big file alone drives RSS up by ~5 GB / 541 chunks ≈ 9 MB/chunk persistent. The next, smaller file does **not** add memory — RSS plateaus until another large file appears, then climbs again. So peak anon-rss tracks "max chunks-in-flight cumulative" plus a per-chunk persistent residue in the long-lived MilvusClient state.

### Naïve workaround that doesn't work
Calling `memsearch index <single-file>` per file in a shell loop. Each call's `MemSearch.index()` runs:

```python
# memsearch/core.py — at end of index()
indexed_sources = self._store.indexed_sources()
for source in indexed_sources:
    if source not in active_sources:
        self._store.delete_by_source(source)
```

`active_sources` is built from the paths just passed to that invocation, so per-file calls treat the rest of the corpus as deleted. After a 6-file shell loop the DB only contains the last file's chunks (39 of `OC-RETIREMENT.md`).

Two paths to fix this from the user side:
1. Use `MemSearch.index_file(path)` directly (no stale-source cleanup) and shard the corpus into byte-budget batches with subprocess restarts.
2. Pass the full corpus path list every call (so cleanup is a no-op) and rely on chunk-hash dedup to skip already-indexed chunks. Doesn't help — the first call still has to embed the whole corpus.

We went with (1).

### Working workaround (open to PRing back)
- A small Python helper that calls `index_file` per path (no destructive cleanup): https://github.com/<your-fork>/blob/.../recall-index-files.py
- A bash batcher that sorts files biggest-first, packs into 1 MB / 64-file batches, and runs each batch in its own subprocess so the heap resets between batches: https://github.com/<your-fork>/blob/.../recall-reindex.sh

Throughput is unaffected: total ≈ same as a single in-process run would be if it didn't OOM, because per-batch model load is ~5 s and the rest is dominated by ONNX inference time.

### Suggested upstream fixes
1. **Document or eliminate the implicit stale-source cleanup in `MemSearch.index()`** (or split it into an explicit `prune_deleted_sources()` API). The current behavior is surprising for callers who shard their corpus across multiple invocations.
2. **Investigate the per-chunk persistent leak in long-lived `MemSearch` instances.** Best guess based on this evidence: pymilvus's `MilvusClient` retains references to inserted records (or grpc / arrow buffers) across `upsert` calls; smaller chunks halve the per-chunk leak which is consistent with payload retention rather than schema overhead. Worth instrumenting with `tracemalloc` snapshots before / after each `_embed_and_store` call on a long corpus.
3. **Consider periodic `gc.collect()` + `malloc_trim(0)` between files** as a stopgap. (Only helps if the leak is reachable garbage, not strong refs.)

Happy to test patches against the reproduction corpus if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing leaks ~9 MB anon-rss per chunk in single process; OOM-kills mid-corpus on 15 GB host #533

Environment

Reproduction

Observed

Per-file probe (instrumented `index_file` loop in one process)

Naïve workaround that doesn't work

Working workaround (open to PRing back)

Suggested upstream fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Run	`--batch-size`	`--max-chunk-size`	ONNX arena	Chunks before OOM	Peak anon-rss	Peak total-vm
1	8	4000	enabled	~80	11.85 GB	17.97 GB
2	4	4000	enabled	~628	11.55 GB	18.77 GB
3	4	4000	disabled (`SessionOptions.enable_cpu_mem_arena=False`)	~628	11.53 GB	12.94 GB
4	4	2000	enabled	~1186	11.55 GB	18.77 GB

Indexing leaks ~9 MB anon-rss per chunk in single process; OOM-kills mid-corpus on 15 GB host #533

Description

Environment

Reproduction

Observed

Per-file probe (instrumented index_file loop in one process)

Naïve workaround that doesn't work

Working workaround (open to PRing back)

Suggested upstream fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Per-file probe (instrumented `index_file` loop in one process)