Skip to content

Indexing leaks ~9 MB anon-rss per chunk in single process; OOM-kills mid-corpus on 15 GB host #533

@rayrage213

Description

@rayrage213

Bug: indexing a multi-MB markdown corpus with the ONNX provider walks anon-rss to ~11.5 GB and the kernel OOM-kills the process partway through. Memory growth is linear in chunks indexed within a single process; per-process restart releases memory normally, so the leak is in long-lived state inside MemSearch/pymilvus/milvus-lite interaction rather than ONNX itself.

The naïve "split work into separate memsearch index <file> calls" workaround is sabotaged by MemSearch.index()'s implicit stale-source cleanup (delete_by_source for any source not in the just-passed paths) — per-file CLI calls treat each other's files as deleted and wipe them.

Environment

  • memsearch 0.4.1 (pip)
  • pymilvus 2.5.x (whatever 0.4.1 pulls in), milvus-lite from same constraint
  • onnxruntime 1.24.4, model gpahal/bge-m3-onnx-int8 (the plugin default)
  • Python 3.13.x in a venv
  • Linux 6.12 / 4-vCPU AMD EPYC 9354P / 15 GB RAM, no swap
  • Corpus: ~518 markdown files, ~7 MB total, ~3000 expected chunks at max_chunk_size=4000 chars

Reproduction

memsearch index --batch-size 4 --max-chunk-size 4000 --force <full-corpus-paths>

Equivalent in code: a single MemSearch.index() call against the full corpus.

Observed

Run --batch-size --max-chunk-size ONNX arena Chunks before OOM Peak anon-rss Peak total-vm
1 8 4000 enabled ~80 11.85 GB 17.97 GB
2 4 4000 enabled ~628 11.55 GB 18.77 GB
3 4 4000 disabled (SessionOptions.enable_cpu_mem_arena=False) ~628 11.53 GB 12.94 GB
4 4 2000 enabled ~1186 11.55 GB 18.77 GB

Kernel log:

kernel: memsearch invoked oom-killer: gfp_mask=0x140cca, order=0, oom_score_adj=0
kernel: Out of memory: Killed process <pid> (memsearch) total-vm:18756488kB, anon-rss:11587564kB,
        file-rss:156kB, shmem-rss:0kB, UID:1001 pgtables:23552kB oom_score_adj:0

Per-chunk leak ≈ (peak_anon_rss − baseline) / chunks_processed:

  • run 2: ≈ 16 MB/chunk
  • run 4: ≈ 8.4 MB/chunk

Per-chunk leak halves with chunk size halved → some of it is working-set memory (tokenizer / batch pad / activations). But disabling the ONNX CPU mem arena (run 3) does not materially reduce peak anon-rss — only total-vm drops by ~6 GB. So the dominant leaked bytes are not in the ONNX session arena.

Per-file probe (instrumented index_file loop in one process)

Indexed Log/DECISIONS.md (934 KB → 541 chunks) as the first file:

start RSS = 1153 MB
[1/518] +541 chunks  rss = 6064 MB  Log/DECISIONS.md
[2/518] + 39 chunks  rss = 6064 MB  Log/OC-RETIREMENT.md

A single big file alone drives RSS up by ~5 GB / 541 chunks ≈ 9 MB/chunk persistent. The next, smaller file does not add memory — RSS plateaus until another large file appears, then climbs again. So peak anon-rss tracks "max chunks-in-flight cumulative" plus a per-chunk persistent residue in the long-lived MilvusClient state.

Naïve workaround that doesn't work

Calling memsearch index <single-file> per file in a shell loop. Each call's MemSearch.index() runs:

# memsearch/core.py — at end of index()
indexed_sources = self._store.indexed_sources()
for source in indexed_sources:
    if source not in active_sources:
        self._store.delete_by_source(source)

active_sources is built from the paths just passed to that invocation, so per-file calls treat the rest of the corpus as deleted. After a 6-file shell loop the DB only contains the last file's chunks (39 of OC-RETIREMENT.md).

Two paths to fix this from the user side:

  1. Use MemSearch.index_file(path) directly (no stale-source cleanup) and shard the corpus into byte-budget batches with subprocess restarts.
  2. Pass the full corpus path list every call (so cleanup is a no-op) and rely on chunk-hash dedup to skip already-indexed chunks. Doesn't help — the first call still has to embed the whole corpus.

We went with (1).

Working workaround (open to PRing back)

  • A small Python helper that calls index_file per path (no destructive cleanup): https://github.com//blob/.../recall-index-files.py
  • A bash batcher that sorts files biggest-first, packs into 1 MB / 64-file batches, and runs each batch in its own subprocess so the heap resets between batches: https://github.com//blob/.../recall-reindex.sh

Throughput is unaffected: total ≈ same as a single in-process run would be if it didn't OOM, because per-batch model load is ~5 s and the rest is dominated by ONNX inference time.

Suggested upstream fixes

  1. Document or eliminate the implicit stale-source cleanup in MemSearch.index() (or split it into an explicit prune_deleted_sources() API). The current behavior is surprising for callers who shard their corpus across multiple invocations.
  2. Investigate the per-chunk persistent leak in long-lived MemSearch instances. Best guess based on this evidence: pymilvus's MilvusClient retains references to inserted records (or grpc / arrow buffers) across upsert calls; smaller chunks halve the per-chunk leak which is consistent with payload retention rather than schema overhead. Worth instrumenting with tracemalloc snapshots before / after each _embed_and_store call on a long corpus.
  3. Consider periodic gc.collect() + malloc_trim(0) between files as a stopgap. (Only helps if the leak is reachable garbage, not strong refs.)

Happy to test patches against the reproduction corpus if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions