Skip to content

Memory leak in consolidation worker during large batch processing #418

@PakAbhishek

Description

@PakAbhishek

Description

Consolidation worker leaks ~300 MiB/min when processing a large backlog of pending consolidations, eventually causing OOM on the host.

Environment

  • Image: ghcr.io/vectorize-io/hindsight:0.4.13-slim
  • VM: 8 GiB RAM (n2-standard-2, GCP)
  • Database: Cloud SQL PostgreSQL 18
  • LLM: Claude Opus 4.5 via LiteLLM proxy -> AWS Bedrock
  • Embeddings: litellm provider with bedrock/amazon.titan-embed-text-v2:0

Steps to Reproduce

  1. Have a bank with ~7,800 memory_units and ~6,600 pending consolidations
  2. Start Hindsight with consolidation workers active
  3. Monitor container memory usage over time

Observed Behavior

Memory grows linearly at ~300 MiB/min regardless of concurrency settings:

Time Container RAM System Available
T+0 5.0 GiB 2.1 GiB
T+2 min 5.6 GiB 1.4 GiB
T+4 min 7.7 GiB 33 MiB (OOM)

Worker stats during the leak:

Tuning Attempted (No Effect)

Reduced all concurrency settings - leak rate unchanged:

Setting Original Tuned
LLM_MAX_CONCURRENT 16 4
DB_POOL_MAX_SIZE 50 10
DB_POOL_MIN_SIZE 10 2

Workaround

Set Docker memory limit so the container auto-restarts before crashing the host:

Container gets killed at 6 GiB, Docker restarts it, consolidation resumes from where it left off. This allows the backlog to be processed across multiple restart cycles while keeping the host responsive.

Expected Behavior

Memory should plateau during consolidation, not grow linearly. The consolidation batch should release memory after processing each chunk of memories.

Additional Context

  • The large backlog was created after a one-time embedding dimension migration (384-dim to 1024-dim)
  • Normal operation (small retains/recalls) does not exhibit the leak - only sustained batch consolidation
  • The full (non-slim) image has a separate memory issue: it loads PyTorch CUDA (~5-6 GiB) even on CPU-only VMs, but that is a different problem from this leak

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions