Skip to content

[Proposal] Multi-Store Vector DB Architecture, Per-KB Index Isolation & OpenSearch Support #913

@ochanism

Description

@ochanism

Background

Scalability Limitations of Current Architecture

WeKnora's retrieval engine architecture works well for small-scale, single-backend environments, but structural limitations emerge when scaling to production.

Limitation 1: Registry is a Singleton per EngineType

The current RetrieveEngineRegistry is implemented as map[RetrieverEngineType]RetrieveEngineService. Attempting to register more than one engine of the same type results in a "repository type %s already registered" error.

// registry.go — current structure
type RetrieveEngineRegistry struct {
    repositories map[types.RetrieverEngineType]interfaces.RetrieveEngineService
}

func (r *RetrieveEngineRegistry) Register(repo interfaces.RetrieveEngineService) error {
    if _, exists := r.repositories[repo.EngineType()]; exists {
        return fmt.Errorf("repository type %s already registered", repo.EngineType())
    }
    // ...
}

This makes it impossible to run multiple instances of the same DB type. For example, you cannot operate two ES clusters for hot/warm tiers or regional separation. (Different DB types — e.g., one ES + one Qdrant — can coexist since they have different EngineTypes, but two instances of the same type cannot.)

Limitation 2: RETRIEVE_DRIVER is a Global Environment Variable

initRetrieveEngineRegistry() in container.go parses os.Getenv("RETRIEVE_DRIVER") once at startup, creating a single client. All Knowledge Bases are forced to share the same vector DB instance.

// container.go — current initialization flow
func initRetrieveEngineRegistry(db *gorm.DB, cfg *config.Config) {
    retrieveDriver := strings.Split(os.Getenv("RETRIEVE_DRIVER"), ",")
    // "elasticsearch_v8" → creates exactly one ES client
    // all KBs share this instance
}

Limitation 3: All KBs Share a Single Index/Collection

Even within the same vector DB instance, all KB data is stored in a single index (or collection). KB isolation relies solely on filtering by the knowledge_base_id field within documents.

// elasticsearch/v8/repository.go — all KB data in a single index
indexName := os.Getenv("ELASTICSEARCH_INDEX")  // default: "xwrag_default"
res := &elasticsearchRepository{client: client, index: indexName}
// → all KB vectors go into this single index

// filtering by knowledge_base_id at query time
must = append(must, types.Query{Terms: &types.TermsQuery{
    TermsQuery: map[string]types.TermsQueryField{
        "knowledge_base_id.keyword": params.KnowledgeBaseIDs,
    },
}})

Milvus similarly uses a single MILVUS_COLLECTION env var for the entire collection name (weknora_embeddings_{dimension}).

Problems with this approach:

  • No performance isolation: If KB-A has 10M documents, a 100-document search on KB-B is still affected by the total index size (especially with brute-force search)
  • No operational flexibility: Cannot tune shard count, replicas, or HNSW parameters per KB
  • No security isolation: Index-level access control is impossible when all KB data resides in a single index
  • Index management complexity: A bloated single index means reindexing, snapshots, and recovery operations affect all KBs

Limitation 4: No Vector Store Binding on Knowledge Base

Retrieval engine configuration exists only at the Tenant level (Tenant.RetrieverEngines). There is no way to choose which vector DB instance or which index to use when creating a KB. Limitations 2 (single instance) and 3 (single index) combine to make flexible per-KB placement fundamentally impossible.

Elasticsearch Vector Search Performance Limitations

The current ES driver performs vector search using script_score + cosineSimilarity:

// elasticsearch/v8/repository.go
scoreSource := "cosineSimilarity(params.query_vector, 'embedding')"

This is a brute-force linear scan (O(N)). Search latency increases linearly with document count:

Documents script_score (brute-force) ANN (HNSW)
100K hundreds of ms < 10 ms
1M seconds (SLA risk) ~10 ms
10M timeout 20-50 ms

Additionally, index creation does not set explicit dense_vector mappings or HNSW parameters, relying on ES auto-mapping:

// elasticsearch/v8/repository.go — index created without mapping
_, err = e.client.Indices.Create(e.index).Do(ctx)
// → no dense_vector type or similarity settings

Proposal Overview

Core Principle: 100% Backward-Compatible, Opt-in Extension

This proposal does not change any existing behavior.
The current approach — single DB, single index, all KBs sharing — continues to work exactly as-is.
New features (multi-store, index isolation, OpenSearch) are activated only when the user explicitly opts in.

This extends the architecture so that, when desired, multiple vector DB instances can be bound at the KB level, and OpenSearch k-NN native vector search is officially supported.

Key Values

  1. Multi-store scalability: Register multiple vector DB instances — same type or different types — and select per KB. Or keep using a single instance shared by all KBs, just like today.
  2. Index isolation: Use independent indices/collections per KB, even within the same DB instance. Or keep using a single shared index, just like today.
  3. Performance: OpenSearch k-NN (HNSW) native vector search improves complexity from O(N) → O(log N)
  4. Gradual transition: Non-breaking soft handover — connect only new KBs to a new store while existing KBs remain untouched

Multi-store and index isolation are independent yet complementary values.

  • Multi-store: place KBs on different DB instances (cluster-level separation)
  • Index isolation: separate KB indices within the same DB instance (index-level separation)
  • Both are needed for flexible production operations.
  • Both are opt-in. If not configured, behavior is 100% identical to today.

Implementation Plan (5 Phases)

Phase Dependencies

Phase 1 (VectorStore + Registry) ─→ Phase 2 (KB binding) ─→ Phase 4 (Cross-store migration)
                                                           ↘ Phase 5 (Backward compat + docs)
Phase 3 (OpenSearch driver) — can proceed independently of Phase 1
  • Phase 1 is the core prerequisite. Phases 2, 4, 5 depend on Phase 1.
  • Phase 3 (OpenSearch driver) can be added to the existing Registry structure, so it can proceed in parallel with Phase 1.

Phase 1: VectorStore Entity + Registry Refactoring

Goal: Manage vector DB instances as first-class entities and extend the Registry to per-instance management.

Changes:

  • New vector_stores table

    vector_stores
    ├── id (PK)
    ├── name            — human-readable name (e.g., "es-hot", "opensearch-prod")
    ├── engine_type     — RetrieverEngineType (elasticsearch, opensearch, qdrant, ...)
    ├── connection_config (JSON) — connection info (addr, username, password, ...)
    ├── index_config (JSON)      — index settings (index_prefix, shards, HNSW params, ...)
    ├── index_strategy  — index isolation strategy: "shared" (current behavior) | "per_kb" (per-KB index)
    ├── is_default      — whether this is the default store
    ├── tenant_id (FK)
    └── created_at / updated_at
    
  • Index isolation strategy (index_strategy)

    • "shared" (default): Same as current behavior. All KB data in a single index, filtered by knowledge_base_id field
    • "per_kb": Automatically creates a separate index per KB (e.g., {index_prefix}_{kb_id})
      • Dedicated index created on KB creation, deleted on KB deletion
      • Independent mapping/HNSW parameters per KB
      • Smaller index size improves both brute-force search and ANN index build times
    • Existing deployments default to "shared", so no behavior change
  • Change Registry key from EngineTypeStoreID

    // Current: map[RetrieverEngineType]RetrieveEngineService   — one per type
    // Proposed: map[StoreID]RetrieveEngineService              — one per instance
  • RETRIEVE_DRIVER environment variable backward compatibility

    • If the env var is set, automatically creates a "default VectorStore" record to preserve existing behavior
    • VectorStore table records take precedence over env vars when present

Affected files: registry.go, container.go, new types/vectorstore.go, migration

Phase 2: KB ↔ VectorStore Binding

Goal: Enable selecting which vector DB to use when creating a KB.

Changes:

  • Add vector_store_id (FK, nullable) to the KnowledgeBase model
  • Add vector_store_id parameter to KB create/update APIs
  • When vector_store_id is not specified:
    1. Use the Tenant's default VectorStore
    2. Fall back to the RETRIEVE_DRIVER-based global default
  • Modify CompositeRetrieveEngine creation to look up the engine from the KB's bound store
  • Index name resolution logic:
    • If store's index_strategy is "per_kb": use {index_prefix}_{kb_id} format for a dedicated KB index
    • If store's index_strategy is "shared": use {index_name} single index + knowledge_base_id filter (current behavior)

Affected files: knowledgebase.go, composite.go, KB CRUD handler/service, index resolution logic in each repository, migration

Phase 3: OpenSearch Driver Implementation

Goal: Native vector search driver using the OpenSearch k-NN plugin.

Changes:

  • Add "opensearch" to RetrieverEngineType
  • Add OpenSearch entry to retrieverEngineMapping
  • New internal/application/repository/retriever/opensearch/ package:
    • Explicit knn_vector mapping + HNSW parameters on index creation
    • Engine selection: Lucene (< 10M docs), Faiss HNSW (≥ 10M docs)
    • k-NN native query (knn DSL) — ANN, not brute-force
    • Hybrid search: k-NN vector + BM25 keyword combination
  • Implements the existing RetrieveEngineRepository interface — same contract as other drivers

Expected performance:

  • 100K docs: hundreds of ms → < 10 ms (10x+)
  • 1M docs: seconds → ~10 ms (100x+)

Phase 4: Cross-store Migration API

Goal: Zero-downtime migration of existing KBs to a different vector DB.

Changes:

  • Extend existing CopyIndices to support cross-store operations
    • Read vector data from source store → write to target store
    • Direct vector copy without re-computing embeddings (cost savings)
  • Migration progress tracking API
  • Soft handover workflow:
    1. Create new VectorStore + new KB (bound to the new store)
    2. Migrate existing KB data to the new KB
    3. Deactivate the old KB after verification

Phase 5: Backward Compatibility + Documentation

Goal: Guarantee zero-downtime upgrades for existing deployments. Zero impact on existing users who change nothing.

Changes:

  • Automatic fallback chain on upgrade:
    1. If VectorStore table is empty → operate identically using RETRIEVE_DRIVER env var
    2. If VectorStore records exist but KB has vector_store_id = NULL → fall back to Tenant default store
    3. If index_strategy is not set → "shared" (single index, current behavior)
  • All existing env vars (RETRIEVE_DRIVER, ELASTICSEARCH_INDEX, MILVUS_COLLECTION, etc.) fully preserved
  • No breaking changes: env-var-only deployments work as before. All new tables/fields are nullable or have defaults
  • Migration guide: transitioning from current → multi-store, current → per_kb index

Backward Compatibility Strategy

The core principle of this proposal is not to break any existing deployment. All new features are opt-in. If nothing is configured, behavior is 100% identical to today.

DB Instance Level

Scenario Behavior
Only RETRIEVE_DRIVER set, VectorStore table empty 100% identical to current. Env-var-based single client
VectorStore records exist, KB has no vector_store_id Falls back to Tenant default store → global default store
KB has vector_store_id set Uses that specific store for search/indexing

Index/Collection Level

Scenario Behavior
index_strategy not set or "shared" 100% identical to current. Single index for all KBs, knowledge_base_id field filtering
index_strategy = "per_kb" Independent index auto-created per KB ({prefix}_{kb_id}). Linked to KB create/delete

Deployment Scenarios

Scenario 1: Existing user (changes nothing)
  RETRIEVE_DRIVER=elasticsearch_v8
  → VectorStore table is empty
  → Same single ES client, single index (xwrag_default)
  → Changes: none. Just upgrade the code and everything works identically.

Scenario 2: Same DB, only want index isolation
  RETRIEVE_DRIVER=elasticsearch_v8
  → Create 1 VectorStore (index_strategy="per_kb")
  → Dedicated index auto-created per KB
  → DB instance stays the same, only indices are separated per KB

Scenario 3-a: Multiple instances of the same DB type
  → VectorStores: ES-hot (recent docs), ES-warm (archive)
  → Frequently searched KBs → ES-hot, archive KBs → ES-warm
  → Same ES type but separate clusters, separate hardware

Scenario 3-b: Mixed DB types
  → VectorStores: ES-legacy (existing), OpenSearch-prod (new)
  → Existing KBs stay on ES as-is, new KBs go to OpenSearch
  → Each store can have its own independent index_strategy

Scenario 4: Gradual transition
  → Existing KBs remain on existing store + shared index
  → Only new KBs created on new store + per_kb index
  → Migrate existing KBs using the migration API when ready

No breaking changes in any scenario. Unless the user explicitly creates a VectorStore or changes index_strategy, everything works identically to today.


Why This Is Valuable for Upstream

  1. Real production needs: Our team hit these limitations in a production environment managing millions of documents. Other production users are likely facing the same issues.

  2. Architecture improvement: This is not just a feature addition — it naturally extends the existing design. It respects the existing interfaces (RetrieveEngineService, RetrieveEngineRegistry) while extending them.

  3. OpenSearch ecosystem: Immediately valuable for users of managed services like AWS OpenSearch Service. k-NN native vector search offers dramatically better performance compared to ES script_score.

  4. Incremental adoption: Even merging just Phase 1 provides the foundation for multi-store architecture, and the OpenSearch driver can be added independently. There is no need to merge all phases at once.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions