Bug: indexing a multi-MB markdown corpus with the ONNX provider walks anon-rss to ~11.5 GB and the kernel OOM-kills the process partway through. Memory growth is linear in chunks indexed within a single process; per-process restart releases memory normally, so the leak is in long-lived state inside MemSearch/pymilvus/milvus-lite interaction rather than ONNX itself.
The naïve "split work into separate memsearch index <file> calls" workaround is sabotaged by MemSearch.index()'s implicit stale-source cleanup (delete_by_source for any source not in the just-passed paths) — per-file CLI calls treat each other's files as deleted and wipe them.
Environment
- memsearch 0.4.1 (pip)
- pymilvus 2.5.x (whatever 0.4.1 pulls in), milvus-lite from same constraint
- onnxruntime 1.24.4, model
gpahal/bge-m3-onnx-int8 (the plugin default)
- Python 3.13.x in a venv
- Linux 6.12 / 4-vCPU AMD EPYC 9354P / 15 GB RAM, no swap
- Corpus: ~518 markdown files, ~7 MB total, ~3000 expected chunks at
max_chunk_size=4000 chars
Reproduction
memsearch index --batch-size 4 --max-chunk-size 4000 --force <full-corpus-paths>
Equivalent in code: a single MemSearch.index() call against the full corpus.
Observed
| Run |
--batch-size |
--max-chunk-size |
ONNX arena |
Chunks before OOM |
Peak anon-rss |
Peak total-vm |
| 1 |
8 |
4000 |
enabled |
~80 |
11.85 GB |
17.97 GB |
| 2 |
4 |
4000 |
enabled |
~628 |
11.55 GB |
18.77 GB |
| 3 |
4 |
4000 |
disabled (SessionOptions.enable_cpu_mem_arena=False) |
~628 |
11.53 GB |
12.94 GB |
| 4 |
4 |
2000 |
enabled |
~1186 |
11.55 GB |
18.77 GB |
Kernel log:
kernel: memsearch invoked oom-killer: gfp_mask=0x140cca, order=0, oom_score_adj=0
kernel: Out of memory: Killed process <pid> (memsearch) total-vm:18756488kB, anon-rss:11587564kB,
file-rss:156kB, shmem-rss:0kB, UID:1001 pgtables:23552kB oom_score_adj:0
Per-chunk leak ≈ (peak_anon_rss − baseline) / chunks_processed:
- run 2: ≈ 16 MB/chunk
- run 4: ≈ 8.4 MB/chunk
Per-chunk leak halves with chunk size halved → some of it is working-set memory (tokenizer / batch pad / activations). But disabling the ONNX CPU mem arena (run 3) does not materially reduce peak anon-rss — only total-vm drops by ~6 GB. So the dominant leaked bytes are not in the ONNX session arena.
Per-file probe (instrumented index_file loop in one process)
Indexed Log/DECISIONS.md (934 KB → 541 chunks) as the first file:
start RSS = 1153 MB
[1/518] +541 chunks rss = 6064 MB Log/DECISIONS.md
[2/518] + 39 chunks rss = 6064 MB Log/OC-RETIREMENT.md
A single big file alone drives RSS up by ~5 GB / 541 chunks ≈ 9 MB/chunk persistent. The next, smaller file does not add memory — RSS plateaus until another large file appears, then climbs again. So peak anon-rss tracks "max chunks-in-flight cumulative" plus a per-chunk persistent residue in the long-lived MilvusClient state.
Naïve workaround that doesn't work
Calling memsearch index <single-file> per file in a shell loop. Each call's MemSearch.index() runs:
# memsearch/core.py — at end of index()
indexed_sources = self._store.indexed_sources()
for source in indexed_sources:
if source not in active_sources:
self._store.delete_by_source(source)
active_sources is built from the paths just passed to that invocation, so per-file calls treat the rest of the corpus as deleted. After a 6-file shell loop the DB only contains the last file's chunks (39 of OC-RETIREMENT.md).
Two paths to fix this from the user side:
- Use
MemSearch.index_file(path) directly (no stale-source cleanup) and shard the corpus into byte-budget batches with subprocess restarts.
- Pass the full corpus path list every call (so cleanup is a no-op) and rely on chunk-hash dedup to skip already-indexed chunks. Doesn't help — the first call still has to embed the whole corpus.
We went with (1).
Working workaround (open to PRing back)
- A small Python helper that calls
index_file per path (no destructive cleanup): https://github.com//blob/.../recall-index-files.py
- A bash batcher that sorts files biggest-first, packs into 1 MB / 64-file batches, and runs each batch in its own subprocess so the heap resets between batches: https://github.com//blob/.../recall-reindex.sh
Throughput is unaffected: total ≈ same as a single in-process run would be if it didn't OOM, because per-batch model load is ~5 s and the rest is dominated by ONNX inference time.
Suggested upstream fixes
- Document or eliminate the implicit stale-source cleanup in
MemSearch.index() (or split it into an explicit prune_deleted_sources() API). The current behavior is surprising for callers who shard their corpus across multiple invocations.
- Investigate the per-chunk persistent leak in long-lived
MemSearch instances. Best guess based on this evidence: pymilvus's MilvusClient retains references to inserted records (or grpc / arrow buffers) across upsert calls; smaller chunks halve the per-chunk leak which is consistent with payload retention rather than schema overhead. Worth instrumenting with tracemalloc snapshots before / after each _embed_and_store call on a long corpus.
- Consider periodic
gc.collect() + malloc_trim(0) between files as a stopgap. (Only helps if the leak is reachable garbage, not strong refs.)
Happy to test patches against the reproduction corpus if useful.
Bug: indexing a multi-MB markdown corpus with the ONNX provider walks anon-rss to ~11.5 GB and the kernel OOM-kills the process partway through. Memory growth is linear in chunks indexed within a single process; per-process restart releases memory normally, so the leak is in long-lived state inside
MemSearch/pymilvus/milvus-liteinteraction rather than ONNX itself.The naïve "split work into separate
memsearch index <file>calls" workaround is sabotaged byMemSearch.index()'s implicit stale-source cleanup (delete_by_sourcefor any source not in the just-passed paths) — per-file CLI calls treat each other's files as deleted and wipe them.Environment
gpahal/bge-m3-onnx-int8(the plugin default)max_chunk_size=4000charsReproduction
Equivalent in code: a single
MemSearch.index()call against the full corpus.Observed
--batch-size--max-chunk-sizeSessionOptions.enable_cpu_mem_arena=False)Kernel log:
Per-chunk leak ≈
(peak_anon_rss − baseline) / chunks_processed:Per-chunk leak halves with chunk size halved → some of it is working-set memory (tokenizer / batch pad / activations). But disabling the ONNX CPU mem arena (run 3) does not materially reduce peak anon-rss — only
total-vmdrops by ~6 GB. So the dominant leaked bytes are not in the ONNX session arena.Per-file probe (instrumented
index_fileloop in one process)Indexed
Log/DECISIONS.md(934 KB → 541 chunks) as the first file:A single big file alone drives RSS up by ~5 GB / 541 chunks ≈ 9 MB/chunk persistent. The next, smaller file does not add memory — RSS plateaus until another large file appears, then climbs again. So peak anon-rss tracks "max chunks-in-flight cumulative" plus a per-chunk persistent residue in the long-lived MilvusClient state.
Naïve workaround that doesn't work
Calling
memsearch index <single-file>per file in a shell loop. Each call'sMemSearch.index()runs:active_sourcesis built from the paths just passed to that invocation, so per-file calls treat the rest of the corpus as deleted. After a 6-file shell loop the DB only contains the last file's chunks (39 ofOC-RETIREMENT.md).Two paths to fix this from the user side:
MemSearch.index_file(path)directly (no stale-source cleanup) and shard the corpus into byte-budget batches with subprocess restarts.We went with (1).
Working workaround (open to PRing back)
index_fileper path (no destructive cleanup): https://github.com//blob/.../recall-index-files.pyThroughput is unaffected: total ≈ same as a single in-process run would be if it didn't OOM, because per-batch model load is ~5 s and the rest is dominated by ONNX inference time.
Suggested upstream fixes
MemSearch.index()(or split it into an explicitprune_deleted_sources()API). The current behavior is surprising for callers who shard their corpus across multiple invocations.MemSearchinstances. Best guess based on this evidence: pymilvus'sMilvusClientretains references to inserted records (or grpc / arrow buffers) acrossupsertcalls; smaller chunks halve the per-chunk leak which is consistent with payload retention rather than schema overhead. Worth instrumenting withtracemallocsnapshots before / after each_embed_and_storecall on a long corpus.gc.collect()+malloc_trim(0)between files as a stopgap. (Only helps if the leak is reachable garbage, not strong refs.)Happy to test patches against the reproduction corpus if useful.