[Proposal] Multi-Store Vector DB Architecture, Per-KB Index Isolation & OpenSearch Support

## Background

### Scalability Limitations of Current Architecture

WeKnora's retrieval engine architecture works well for small-scale, single-backend environments, but structural limitations emerge when scaling to production.

**Limitation 1: Registry is a Singleton per EngineType**

The current `RetrieveEngineRegistry` is implemented as `map[RetrieverEngineType]RetrieveEngineService`. Attempting to register more than one engine of the same type results in a `"repository type %s already registered"` error.

```go
// registry.go — current structure
type RetrieveEngineRegistry struct {
    repositories map[types.RetrieverEngineType]interfaces.RetrieveEngineService
}

func (r *RetrieveEngineRegistry) Register(repo interfaces.RetrieveEngineService) error {
    if _, exists := r.repositories[repo.EngineType()]; exists {
        return fmt.Errorf("repository type %s already registered", repo.EngineType())
    }
    // ...
}
```

This makes it impossible to run multiple instances of the same DB type. For example, you cannot operate two ES clusters for hot/warm tiers or regional separation. (Different DB types — e.g., one ES + one Qdrant — can coexist since they have different EngineTypes, but two instances of the same type cannot.)

**Limitation 2: RETRIEVE_DRIVER is a Global Environment Variable**

`initRetrieveEngineRegistry()` in `container.go` parses `os.Getenv("RETRIEVE_DRIVER")` once at startup, creating a single client. All Knowledge Bases are forced to share the same vector DB instance.

```go
// container.go — current initialization flow
func initRetrieveEngineRegistry(db *gorm.DB, cfg *config.Config) {
    retrieveDriver := strings.Split(os.Getenv("RETRIEVE_DRIVER"), ",")
    // "elasticsearch_v8" → creates exactly one ES client
    // all KBs share this instance
}
```

**Limitation 3: All KBs Share a Single Index/Collection**

Even within the same vector DB instance, all KB data is stored in **a single index (or collection)**. KB isolation relies solely on filtering by the `knowledge_base_id` field within documents.

```go
// elasticsearch/v8/repository.go — all KB data in a single index
indexName := os.Getenv("ELASTICSEARCH_INDEX")  // default: "xwrag_default"
res := &elasticsearchRepository{client: client, index: indexName}
// → all KB vectors go into this single index

// filtering by knowledge_base_id at query time
must = append(must, types.Query{Terms: &types.TermsQuery{
    TermsQuery: map[string]types.TermsQueryField{
        "knowledge_base_id.keyword": params.KnowledgeBaseIDs,
    },
}})
```

Milvus similarly uses a single `MILVUS_COLLECTION` env var for the entire collection name (`weknora_embeddings_{dimension}`).

Problems with this approach:
- **No performance isolation**: If KB-A has 10M documents, a 100-document search on KB-B is still affected by the total index size (especially with brute-force search)
- **No operational flexibility**: Cannot tune shard count, replicas, or HNSW parameters per KB
- **No security isolation**: Index-level access control is impossible when all KB data resides in a single index
- **Index management complexity**: A bloated single index means reindexing, snapshots, and recovery operations affect all KBs

**Limitation 4: No Vector Store Binding on Knowledge Base**

Retrieval engine configuration exists only at the Tenant level (`Tenant.RetrieverEngines`). There is no way to choose which vector DB instance or which index to use when creating a KB. Limitations 2 (single instance) and 3 (single index) combine to make flexible per-KB placement fundamentally impossible.

### Elasticsearch Vector Search Performance Limitations

The current ES driver performs vector search using `script_score` + `cosineSimilarity`:

```go
// elasticsearch/v8/repository.go
scoreSource := "cosineSimilarity(params.query_vector, 'embedding')"
```

This is a brute-force linear scan (O(N)). Search latency increases linearly with document count:

| Documents | script_score (brute-force) | ANN (HNSW) |
|-----------|--------------------------|------------|
| 100K | hundreds of ms | < 10 ms |
| 1M | seconds (SLA risk) | ~10 ms |
| 10M | timeout | 20-50 ms |

Additionally, index creation does not set explicit `dense_vector` mappings or HNSW parameters, relying on ES auto-mapping:

```go
// elasticsearch/v8/repository.go — index created without mapping
_, err = e.client.Indices.Create(e.index).Do(ctx)
// → no dense_vector type or similarity settings
```

---

## Proposal Overview

### Core Principle: 100% Backward-Compatible, Opt-in Extension

> **This proposal does not change any existing behavior.**
> The current approach — single DB, single index, all KBs sharing — continues to work exactly as-is.
> New features (multi-store, index isolation, OpenSearch) are activated **only when the user explicitly opts in.**

This extends the architecture so that, **when desired**, multiple vector DB instances can be bound at the KB level, and **OpenSearch k-NN** native vector search is officially supported.

### Key Values

1. **Multi-store scalability**: Register multiple vector DB instances — same type or different types — and select per KB. **Or keep using a single instance shared by all KBs, just like today.**
2. **Index isolation**: Use independent indices/collections per KB, even within the same DB instance. **Or keep using a single shared index, just like today.**
3. **Performance**: OpenSearch k-NN (HNSW) native vector search improves complexity from O(N) → O(log N)
4. **Gradual transition**: Non-breaking soft handover — connect only new KBs to a new store while existing KBs remain untouched

> **Multi-store and index isolation are independent yet complementary values.**
> - Multi-store: place KBs on **different DB instances** (cluster-level separation)
> - Index isolation: separate KB indices **within the same DB instance** (index-level separation)
> - Both are needed for flexible production operations.
> - **Both are opt-in. If not configured, behavior is 100% identical to today.**

---

## Implementation Plan (5 Phases)

### Phase Dependencies

```
Phase 1 (VectorStore + Registry) ─→ Phase 2 (KB binding) ─→ Phase 4 (Cross-store migration)
                                                           ↘ Phase 5 (Backward compat + docs)
Phase 3 (OpenSearch driver) — can proceed independently of Phase 1
```

- Phase 1 is the core prerequisite. Phases 2, 4, 5 depend on Phase 1.
- **Phase 3 (OpenSearch driver) can be added to the existing Registry structure, so it can proceed in parallel with Phase 1.**

### Phase 1: VectorStore Entity + Registry Refactoring

**Goal**: Manage vector DB instances as first-class entities and extend the Registry to per-instance management.

**Changes**:

- New `vector_stores` table
  ```
  vector_stores
  ├── id (PK)
  ├── name            — human-readable name (e.g., "es-hot", "opensearch-prod")
  ├── engine_type     — RetrieverEngineType (elasticsearch, opensearch, qdrant, ...)
  ├── connection_config (JSON) — connection info (addr, username, password, ...)
  ├── index_config (JSON)      — index settings (index_prefix, shards, HNSW params, ...)
  ├── index_strategy  — index isolation strategy: "shared" (current behavior) | "per_kb" (per-KB index)
  ├── is_default      — whether this is the default store
  ├── tenant_id (FK)
  └── created_at / updated_at
  ```

- **Index isolation strategy** (`index_strategy`)
  - `"shared"` (default): Same as current behavior. All KB data in a single index, filtered by `knowledge_base_id` field
  - `"per_kb"`: Automatically creates a separate index per KB (e.g., `{index_prefix}_{kb_id}`)
    - Dedicated index created on KB creation, deleted on KB deletion
    - Independent mapping/HNSW parameters per KB
    - Smaller index size improves both brute-force search and ANN index build times
  - Existing deployments default to `"shared"`, so no behavior change

- Change Registry key from `EngineType` → `StoreID`
  ```go
  // Current: map[RetrieverEngineType]RetrieveEngineService   — one per type
  // Proposed: map[StoreID]RetrieveEngineService              — one per instance
  ```

- `RETRIEVE_DRIVER` environment variable backward compatibility
  - If the env var is set, automatically creates a "default VectorStore" record to preserve existing behavior
  - VectorStore table records take precedence over env vars when present

**Affected files**: `registry.go`, `container.go`, new `types/vectorstore.go`, migration

### Phase 2: KB ↔ VectorStore Binding

**Goal**: Enable selecting which vector DB to use when creating a KB.

**Changes**:

- Add `vector_store_id` (FK, nullable) to the `KnowledgeBase` model
- Add `vector_store_id` parameter to KB create/update APIs
- When `vector_store_id` is not specified:
  1. Use the Tenant's default VectorStore
  2. Fall back to the `RETRIEVE_DRIVER`-based global default
- Modify `CompositeRetrieveEngine` creation to look up the engine from the KB's bound store
- **Index name resolution logic**:
  - If store's `index_strategy` is `"per_kb"`: use `{index_prefix}_{kb_id}` format for a dedicated KB index
  - If store's `index_strategy` is `"shared"`: use `{index_name}` single index + `knowledge_base_id` filter (current behavior)

**Affected files**: `knowledgebase.go`, `composite.go`, KB CRUD handler/service, index resolution logic in each repository, migration

### Phase 3: OpenSearch Driver Implementation

**Goal**: Native vector search driver using the OpenSearch k-NN plugin.

**Changes**:

- Add `"opensearch"` to `RetrieverEngineType`
- Add OpenSearch entry to `retrieverEngineMapping`
- New `internal/application/repository/retriever/opensearch/` package:
  - Explicit `knn_vector` mapping + HNSW parameters on index creation
  - Engine selection: Lucene (< 10M docs), Faiss HNSW (≥ 10M docs)
  - k-NN native query (`knn` DSL) — ANN, not brute-force
  - Hybrid search: k-NN vector + BM25 keyword combination
- Implements the existing `RetrieveEngineRepository` interface — same contract as other drivers

**Expected performance**:
- 100K docs: hundreds of ms → < 10 ms (10x+)
- 1M docs: seconds → ~10 ms (100x+)

### Phase 4: Cross-store Migration API

**Goal**: Zero-downtime migration of existing KBs to a different vector DB.

**Changes**:

- Extend existing `CopyIndices` to support cross-store operations
  - Read vector data from source store → write to target store
  - Direct vector copy without re-computing embeddings (cost savings)
- Migration progress tracking API
- Soft handover workflow:
  1. Create new VectorStore + new KB (bound to the new store)
  2. Migrate existing KB data to the new KB
  3. Deactivate the old KB after verification

### Phase 5: Backward Compatibility + Documentation

**Goal**: Guarantee zero-downtime upgrades for existing deployments. **Zero impact on existing users who change nothing.**

**Changes**:

- Automatic fallback chain on upgrade:
  1. If VectorStore table is empty → operate identically using `RETRIEVE_DRIVER` env var
  2. If VectorStore records exist but KB has `vector_store_id` = NULL → fall back to Tenant default store
  3. If `index_strategy` is not set → `"shared"` (single index, current behavior)
- **All existing env vars (`RETRIEVE_DRIVER`, `ELASTICSEARCH_INDEX`, `MILVUS_COLLECTION`, etc.) fully preserved**
- No breaking changes: env-var-only deployments work as before. All new tables/fields are nullable or have defaults
- Migration guide: transitioning from current → multi-store, current → per_kb index

---

## Backward Compatibility Strategy

The core principle of this proposal is **not to break any existing deployment.** All new features are opt-in. If nothing is configured, behavior is 100% identical to today.

### DB Instance Level

| Scenario | Behavior |
|----------|----------|
| Only `RETRIEVE_DRIVER` set, VectorStore table empty | **100% identical to current.** Env-var-based single client |
| VectorStore records exist, KB has no `vector_store_id` | Falls back to Tenant default store → global default store |
| KB has `vector_store_id` set | Uses that specific store for search/indexing |

### Index/Collection Level

| Scenario | Behavior |
|----------|----------|
| `index_strategy` not set or `"shared"` | **100% identical to current.** Single index for all KBs, `knowledge_base_id` field filtering |
| `index_strategy = "per_kb"` | Independent index auto-created per KB (`{prefix}_{kb_id}`). Linked to KB create/delete |

### Deployment Scenarios

```
Scenario 1: Existing user (changes nothing)
  RETRIEVE_DRIVER=elasticsearch_v8
  → VectorStore table is empty
  → Same single ES client, single index (xwrag_default)
  → Changes: none. Just upgrade the code and everything works identically.

Scenario 2: Same DB, only want index isolation
  RETRIEVE_DRIVER=elasticsearch_v8
  → Create 1 VectorStore (index_strategy="per_kb")
  → Dedicated index auto-created per KB
  → DB instance stays the same, only indices are separated per KB

Scenario 3-a: Multiple instances of the same DB type
  → VectorStores: ES-hot (recent docs), ES-warm (archive)
  → Frequently searched KBs → ES-hot, archive KBs → ES-warm
  → Same ES type but separate clusters, separate hardware

Scenario 3-b: Mixed DB types
  → VectorStores: ES-legacy (existing), OpenSearch-prod (new)
  → Existing KBs stay on ES as-is, new KBs go to OpenSearch
  → Each store can have its own independent index_strategy

Scenario 4: Gradual transition
  → Existing KBs remain on existing store + shared index
  → Only new KBs created on new store + per_kb index
  → Migrate existing KBs using the migration API when ready
```

**No breaking changes in any scenario.** Unless the user explicitly creates a VectorStore or changes `index_strategy`, everything works identically to today.

---

## Why This Is Valuable for Upstream

1. **Real production needs**: Our team hit these limitations in a production environment managing millions of documents. Other production users are likely facing the same issues.

2. **Architecture improvement**: This is not just a feature addition — it naturally extends the existing design. It respects the existing interfaces (`RetrieveEngineService`, `RetrieveEngineRegistry`) while extending them.

3. **OpenSearch ecosystem**: Immediately valuable for users of managed services like AWS OpenSearch Service. k-NN native vector search offers dramatically better performance compared to ES `script_score`.

4. **Incremental adoption**: Even merging just Phase 1 provides the foundation for multi-store architecture, and the OpenSearch driver can be added independently. There is no need to merge all phases at once.

Scenario	Behavior
`index_strategy` not set or `"shared"`	100% identical to current. Single index for all KBs, `knowledge_base_id` field filtering
`index_strategy = "per_kb"`	Independent index auto-created per KB (`{prefix}_{kb_id}`). Linked to KB create/delete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Multi-Store Vector DB Architecture, Per-KB Index Isolation & OpenSearch Support #913

Background

Scalability Limitations of Current Architecture

Elasticsearch Vector Search Performance Limitations

Proposal Overview

Core Principle: 100% Backward-Compatible, Opt-in Extension

Key Values

Implementation Plan (5 Phases)

Phase Dependencies

Phase 1: VectorStore Entity + Registry Refactoring

Phase 2: KB ↔ VectorStore Binding

Phase 3: OpenSearch Driver Implementation

Phase 4: Cross-store Migration API

Phase 5: Backward Compatibility + Documentation

Backward Compatibility Strategy

DB Instance Level

Index/Collection Level

Deployment Scenarios

Why This Is Valuable for Upstream

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Documents	script_score (brute-force)	ANN (HNSW)
100K	hundreds of ms	< 10 ms
1M	seconds (SLA risk)	~10 ms
10M	timeout	20-50 ms

Scenario	Behavior
Only `RETRIEVE_DRIVER` set, VectorStore table empty	100% identical to current. Env-var-based single client
VectorStore records exist, KB has no `vector_store_id`	Falls back to Tenant default store → global default store
KB has `vector_store_id` set	Uses that specific store for search/indexing

[Proposal] Multi-Store Vector DB Architecture, Per-KB Index Isolation & OpenSearch Support #913

Description

Background

Scalability Limitations of Current Architecture

Elasticsearch Vector Search Performance Limitations

Proposal Overview

Core Principle: 100% Backward-Compatible, Opt-in Extension

Key Values

Implementation Plan (5 Phases)

Phase Dependencies

Phase 1: VectorStore Entity + Registry Refactoring

Phase 2: KB ↔ VectorStore Binding

Phase 3: OpenSearch Driver Implementation

Phase 4: Cross-store Migration API

Phase 5: Backward Compatibility + Documentation

Backward Compatibility Strategy

DB Instance Level

Index/Collection Level

Deployment Scenarios

Why This Is Valuable for Upstream

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions