Skip to content

feat(retriever): add OpenSearch driver skeleton with gated interface stubs#1481

Open
ochanism wants to merge 1 commit into
Tencent:mainfrom
ochanism:enhancement/1208-2a
Open

feat(retriever): add OpenSearch driver skeleton with gated interface stubs#1481
ochanism wants to merge 1 commit into
Tencent:mainfrom
ochanism:enhancement/1208-2a

Conversation

@ochanism
Copy link
Copy Markdown
Contributor

Summary

Lay down the OpenSearch k-NN driver package at internal/application/repository/retriever/opensearch/ as a hollow, interface-compliant shell, with a strict feature gate carried over from PR 1: every behavioural method (Save / BatchSave / Retrieve / DeleteBy*, plus the previously-stubbed CopyIndices / BatchUpdate* / swapToVersion) returns ErrFeatureNotEnabled, and EstimateStorageSize returns a conservative HNSW lower-bound. No registry / factory / env path mentions this driver — attempting to register an engine_type=opensearch VectorStore continues to fail with the existing "not a valid engine type" error inherited from PR 1's validEngineTypes gate, and RETRIEVE_DRIVER=opensearch remains a silent no-op.

The skeleton ships the connect-and-survey pieces of the driver (NewRepository constructor, TLS-aware HTTP transport, k-NN HNSW + content-analyzer index mapping, lazy per-dimension index creation, sentinel error taxonomy) and the structural test coverage that pins those pieces. PR 2b lands the real read/write implementations (query.go + retrieve.go + crud.go) by replacing the ErrFeatureNotEnabled stubs in this PR's stubs.go with production code; PR 3 ships the activation switch + async paths + i18n + dev profile.

Two test fixes are folded in to keep the run-everything CI signal green on top of PR 1-merged main — see the Test fixes folded in section below for the cover letter on each.

Context

Part of #1440 — Phase 3 (OpenSearch k-NN Native Vector Search Driver) of the multi-store roadmap.

Phase 3 PR sequence (this is PR 2a of 3, after the original PR 2 was split into 2a + 2b for reviewer-load balancing):

# PR Status
1 type prep (gated, no activation) MERGED (#1445)
2a this PR — driver skeleton + interface stubs (gated dead code) here
2b driver read/write implementations (gated dead code) planned, depends on this PR
3 activation switch + async ops + frontend renderer + polish planned

Depends on Phase 1 (#921, MERGED), Phase 2 (#993, all 5 PRs MERGED: #994, #1310, #1372, #1386, #1427), and Phase 3 PR 1 (#1445, MERGED). Every prerequisite is on main. PR 2b and PR 3 land on top of this branch in sequence; the OpenSearch gate flips to "on" only when PR 3 merges.

Changes at a glance

New package internal/application/repository/retriever/opensearch/ — 6 production files + 2 test files + 1 dependency. ~1170 LoC production + ~1115 LoC tests, all gated dead code on merge.

  • repository.goRepository struct + NewRepository(ctx, client, storeID, indexCfg) (repo, error) constructor that validates cluster reachability + OS version (2.4+ / 3.x; primary tested 3.3.2) + opensearch-knn plugin presence on every cluster node before returning. sync.Once-guarded ensureReady(ctx, dim) lazily creates the per-dimension index on first use; transient errors (ErrTransport, ErrCircuitBreaker) are explicitly not cached so a momentary cluster blip does not permanently poison a dim. sanitizeIndexName enforces a strict OS-compatible name spec (^[a-z0-9][a-z0-9_-]{0,254}$) and rejects wildcards / control chars / leading punctuation. probeVersion uses strings.Split + strconv.Atoi for robust semver parsing across pre-release suffixes (3.0.0-rc1) and missing-patch forms (2.10); supported-version matrix is OS 2.4+ accept (with a warn-log for 2.42.10 below LTS), OS 1.x / 2.02.3 / Elasticsearch reject. EngineType() returns the constant added in PR 1; Support() returns [keywords, vector].

  • transport.gonewOpenSearchClient(cfg *ConnectionConfig) (*osapi.Client, error) ships the TLS posture (MinVersion: TLS 1.2, opt-in InsecureSkipVerify reading from the field added in PR 1, forward-secrecy-only cipher list excluding TLS_RSA_*) and the transport tuning (per-host idle pool, idle / response timeouts) for the driver. The only callers will be PR 3's container.go + engine_factory.go; PR 2a remains gated dead code.

  • mapping.gobuildIndexMapping(cfg, dim) produces the full knn_vector + HNSW + content-analyzer mapping with every *_id field as an explicit keyword (avoiding the ES v8 .keyword suffix auto-detection dance) and source_type as integer (stable across SourceType enum extension). buildKeywordsMapping ships the dim-less keyword-only index mapping for the no-embedding save path. createIndexAndAlias creates <alias>_v1 and aliases <alias> to it, with best-effort orphan cleanup on aliasPut failure and mapping-drift detection on resource_already_exists_exception.

  • config.gointernalCfg (value type, not pointer) applying OpenSearch defaults that match the entry shipped by PR 3 (hnsw_m=16, ef_construction=100, ef_search=100, shards=4, replicas=1, engine=lucene).

  • errors.go — nine sentinels: ErrIndexNotFound, ErrDimensionMismatch, ErrAuth, ErrTransport, ErrVersionUnsupported, ErrConfigInvalid, ErrFeatureNotEnabled, ErrBatchTooLarge, ErrCircuitBreaker. The repository never imports internal/errors; PR 3's engine factory wraps these sentinels to typed AppError 2200 / 2201.

  • stubs.go — every behavioural method returns ErrFeatureNotEnabled. EstimateStorageSize returns a conservative HNSW lower-bound estimate (N * (1024 + 4*768 + 128) bytes per chunk, the 768 being a default-conservative dim guess) so the Phase 2 KB-delete guard fails closed for non-empty KBs rather than treating them as empty.

  • repository_test.go (~970 LoC) — 50 cases covering interface satisfaction, sentinel mapping, sanitizeIndexName positive/negative matrix, semver parsing (pre-release / missing-patch), buildIndexMapping JSON shape pin (Lucene + Faiss + Keywords mapping), probeVersion matrix (OS 1.x / 2.2 / 2.5 / 2.11 / 3.x / 3.0.0-rc1 / ES rejection), probeKNNPlugin multi-node coverage, ensureReady concurrency + per-dim isolation + transient retry, NewRepository storeID validation, all 11 stubs (CopyIndices / BatchUpdate* / EstimateStorageSize / swapToVersion + Save / BatchSave / Retrieve / DeleteBy*), wrapTransport sentinel mapping + cluster-side-Reason leak guard, isNotFound / isAlreadyExistsError, drainAndClose / limitedDecode helpers.

  • transport_test.go (~137 LoC) — TLS defaults (verify on, no InsecureSkipVerify), opt-in InsecureSkipVerify (only via the explicit config field), TLS 1.2 minimum pinning, cipher list excludes TLS_RSA_*, transport tuning fields.

  • go.mod / go.sum — single dependency addition: github.com/opensearch-project/opensearch-go/v4 v4.6.0. The SDK speaks to both OS 2.x and 3.x clusters with the same wire protocol for the 10 APIs this driver uses (see #1440's cross-version compatibility table). Transitive deps pull in only go-querystring; no AWS SDK (we don't use AWS Sigv4 transport).

Test fixes folded in

Two deterministic regressions surfaced on a full go test ./... against PR 1-merged main. Both are unrelated to the driver code above but block a clean run-everything signal, so they ride along here.

(1) Follow-up to #1445 — fanout test missed the new normalizer policy (internal/application/service/knowledgebase_search_fanout_test.go, +46 / −6).

PR 1 changed EngineAwareNormalizer for Elasticsearch / ElasticFaiss / OpenSearch / Weaviate / Postgres / SQLite / Qdrant / TencentVectorDB / Doris from (score + 1) / 2 (cosine-shift) to clamp01(score) (passthrough). The grounds were that those engines surface non-negative cosine to the normalizer (Lucene script_score non-negative invariant for ES; k-NN plugin SpaceType.COSINESIMIL.scoreTranslation for OpenSearch; engine-internal conversions or IR-normalized embeddings for the rest). Milvus is now the only engine that still surfaces raw signed cosine in [-1, 1].

TestRetrieveFromStores_MixedEngine_Normalizes was authored before #1445 and still asserted the old cosine-shift behaviour for ES (raw -0.4 → expected 0.3). Under the new passthrough policy that input now maps to clamp01(-0.4) = 0, breaking the test. PR 1's normalizer_test.go was updated at amend time, but this fan-out integration test in the service/ package was missed.

The fix: rewrite the godoc preamble to spell out which engines passthrough vs cosine-shift; restate sub-case 2 as ES passthrough on a production-possible mid-range cosine (0.3 → 0.3, PG out-ranks ES — preserving the original ranking property); add sub-case 3 pinning the cosine-shift branch via Milvus -0.4 → 0.3 so the [-1, 1] arm stays under integration-test coverage.

(2) Pre-existing — SSRF whitelist singleton race surfaced by this run (internal/utils/security.go + internal/utils/security_test.go + internal/infrastructure/web_search/searxng_test.go, +33 / −9 across 3 files).

loadSSRFWhitelist in internal/utils/security.go caches its parse of SSRF_WHITELIST via sync.Once on first call. Tests inside internal/utils/ reset the singleton between cases via resetSSRFWhitelistForTest, but that helper was unexported, so tests in other packages could not reset and ended up seeing whatever whitelist was cached by the first sync.Once.Do() within the same test binary.

The concrete failure: in internal/infrastructure/web_search/, the test binary contains both proxy_test.go and searxng_test.go. By alphabetical order TestValidateProxyURL runs before TestValidateSearxngBaseURL. TestValidateProxyURL exercises ValidateURLForSSRF with no SSRF_WHITELIST set, which causes loadSSRFWhitelist to cache an empty whitelist for the rest of the process. The later TestValidateSearxngBaseURL/loopback_ok_via_whitelist sub-case then sets SSRF_WHITELIST=127.0.0.1,localhost and expects 127.0.0.1 to be whitelisted — but the singleton is already cached empty, so ValidateURLForSSRF still rejects 127.0.0.1 with \"hostname 127.0.0.1 is restricted\" and the test fails. Running the searxng tests in isolation (-run TestValidateSearxngBaseURL) hides the bug because proxy_test never runs to seed the singleton.

This is pre-existing on main; surfaced now because this branch is the first to do a full go test ./... run on top of #1445.

The fix: capitalize the helper to ResetSSRFWhitelistForTest (the ForTest suffix is the test-only contract); update the in-package callers in utils/security_test.go to the new name; in web_search/searxng_test.go, import internal/utils and call utils.ResetSSRFWhitelistForTest() around the env mutation in both TestValidateSearxngBaseURL and TestSearxngProvider_Search (reset on both entry and exit so the singleton state and SSRF_WHITELIST env stay in sync for subsequent tests in the same binary). No production code path changes — the reset helper is still test-only by contract, and no caller outside *_test.go invokes it.

SDK quirks observed against opensearch-go v4.6.0

Two of the three quirks discovered during full implementation surface in PR 2a's code (the third, Refresh: *bool, only affects the delete path that ships in PR 2b):

  • client.Indices.Alias.Exists returns plain *errors.errorString on non-2xx. The SDK passes dataPointer=nil to its internal do(), so 404 responses come back as \"status: 404 Not Found\" rather than as the typed *opensearch.StructError that the rest of this package's error classification expects. aliasExists inspects resp.StatusCode directly (resp is returned even when err is non-nil) and only falls back to wrapTransport for the "no response at all" case. This is the one intentional exception to the package's "typed errors only, no string matching" policy — the status code is read off *http.Response, not the error string.

  • sync.OnceReset is not in the standard library. Honouring the transient-vs-permanent error policy is straightforward for the per-dimension path because sync.Once instances live in a map[int]*sync.Once we can mutate (delete the entry to reset). The keyword-only fallback path (<base>_keywords, no per-dim variant) has only one such instance, so it uses a sync.Mutex + keywordsReady bool + keywordsErr error flag pair instead of a bare sync.Once. The transient retry semantics are identical.

Backward compatibility

  • New package — additive only. No existing file modified except go.mod / go.sum, the two test files in the Test fixes folded in section, and the test-only export rename in utils/security.go.
  • Driver is unreachable: no registry / factory / container / env path activates it. Attempting engine_type=opensearch VectorStore creation continues to fail with the "not a valid engine type" error from PR 1's validEngineTypes gate. RETRIEVE_DRIVER=opensearch boot remains a silent no-op (the env-store builder has no opensearch arm). PR 1's TestOpenSearchRetrieverEngineType_NotInValidEngineTypes is the standing regression guard for the gated invariant.
  • No migrations. The k-NN index is created lazily on first KB use after PR 3 ships, not on this PR's merge.
  • The PR 1 normalizer case for OpenSearch remains unreachable in this PR (no driver instance produces a result yet).
  • Single dependency added (opensearch-go/v4 v4.6.0); govulncheck ./... recommended as a pre-merge gate.

Test plan

  • go build ./... — clean
  • go vet ./... — clean
  • gofmt -l on touched files — clean
  • go test -race -count=1 ./internal/application/repository/retriever/opensearch/... — passes (50 cases, ~1115 LoC of test code)
  • go test ./... on top of PR 1-merged main — the only remaining failure is TestOssEnsureBucket_CreateFails in internal/application/service/file/, which is pre-existing on main and fails identically against upstream/main before this PR (depends on Aliyun OSS endpoint reachability + invalid-credentials response shape; this PR touches zero files in that package). The two test fixes folded in restore green on service/ (fanout) and web_search/ (SSRF singleton).
  • Gated invariant (static): validEngineTypes map in internal/types/vectorstore.go has no OpenSearchRetrieverEngineType entry, so (*VectorStore).Validate() returns \"unsupported engine type: opensearch\" for any registration attempt.
  • Gated invariant (static): grep -r \"case types.OpenSearchRetrieverEngineType\" internal/ shows only PR 1's normalizer case + this driver's own EngineType() and tests — no activation path.
  • Gated invariant (static): grep -r \"case \\\"opensearch\\\"\" internal/ shows no hits in container.go / engine_factory.go / BuildEnvVectorStores.

… 2a of 3)

First half of the gated OpenSearch k-NN driver introduced in PR 1
(Tencent#1445) by way of Tencent#1440. PR 2a ships a hollow, interface-compliant
shell of the `internal/application/repository/retriever/opensearch/`
package — every behavioural method (Save / BatchSave / DeleteBy* /
Retrieve, plus the previously-stubbed CopyIndices / BatchUpdate* /
EstimateStorageSize / swapToVersion) returns `ErrFeatureNotEnabled`
or a conservative sentinel value. PR 2b lands the real read/write
implementations in dedicated files (`query.go` + `retrieve.go` +
`crud.go`) and replaces the stubs accordingly.

Strict feature-gate (unchanged from PR 1): no entry is added to
validEngineTypes / GetVectorStoreTypes / retrieverEngineMapping /
BuildEnvVectorStores / container env path / engine factory switch,
so the driver remains unreachable. Attempting to register an
`engine_type=opensearch` VectorStore continues to fail with the
existing "not a valid engine type" error.

What lands in PR 2a
-------------------

Driver skeleton (6 production files + 2 test files, ~1170 + ~1115 LoC):

- `repository.go` — Repository struct + NewRepository constructor
  that validates cluster reachability + OS version (2.4+ / 3.x;
  primary tested 3.3.2) + k-NN plugin presence on every cluster
  node. sync.Once-guarded ensureReady(ctx, dim) for lazy per-
  dimension index creation, with transient errors not cached so a
  momentary cluster blip does not permanently poison a dim.
  sanitizeIndexName enforces a strict OS-compatible name spec.
  probeVersion uses robust strings.Split/Atoi parsing for
  pre-release suffixes and missing-patch versions. EngineType
  returns the PR 1 constant; Support returns [keywords, vector].
- `transport.go` — newOpenSearchClient ships TLS posture
  (MinVersion TLS 1.2, opt-in InsecureSkipVerify, forward-secrecy-
  only cipher list) and transport tuning for the driver. Caller
  exists only in PR 3 (container.go + engine_factory.go); PR 2a
  remains gated dead code.
- `mapping.go` — buildIndexMapping(cfg, dim) produces the full
  knn_vector + HNSW + content-analyzer mapping with every *_id
  field as an explicit keyword and source_type as integer.
  buildKeywordsMapping ships the dim-less keyword-only index
  mapping used by the no-embedding save path. createIndexAndAlias
  creates <alias>_v1 and aliases <alias> to it, with best-effort
  orphan cleanup and mapping-drift detection.
- `config.go` — internalCfg (value type) applying OpenSearch
  defaults (hnsw_m=16, ef_construction=100, ef_search=100,
  shards=4, replicas=1, engine=lucene).
- `errors.go` — nine sentinels (ErrIndexNotFound,
  ErrDimensionMismatch, ErrAuth, ErrTransport,
  ErrVersionUnsupported, ErrConfigInvalid, ErrFeatureNotEnabled,
  ErrBatchTooLarge, ErrCircuitBreaker). Repository never imports
  apperrors; PR 3's engine factory wraps these to typed AppError
  2200/2201.
- `stubs.go` — every behavioural method returns
  ErrFeatureNotEnabled. EstimateStorageSize returns a conservative
  HNSW lower-bound estimate (not 0) so the Phase 2 KB-delete guard
  fails-closed for non-empty KBs.

Tests (~1115 LoC, 50 cases):

- `repository_test.go` — interface satisfaction, sentinel mapping,
  sanitizeIndexName positive/negative matrix, semver parsing
  (pre-release / missing-patch), buildIndexMapping JSON shape pin
  (Lucene + Faiss + Keywords), probeVersion matrix (OS 1.x / 2.2 /
  2.5 / 2.11 / 3.x / 3.0.0-rc1 / ES rejection), probeKNNPlugin
  multi-node coverage, ensureReady concurrency + per-dim isolation
  + transient retry, NewRepository storeID validation, all 11
  stubs (CopyIndices, BatchUpdate*, EstimateStorageSize,
  SwapToVersion + Save / BatchSave / Retrieve / DeleteBy*),
  wrapTransport sentinel mapping + leak guard, isNotFound /
  isAlreadyExistsError, drainAndClose / limitedDecode helpers.
- `transport_test.go` — TLS defaults / opt-in InsecureSkipVerify /
  TLS 1.2 pinning / cipher list / transport tuning.

Single dependency addition: github.com/opensearch-project/
opensearch-go/v4 v4.6.0 in go.mod/go.sum.

SDK quirks discovered (opensearch-go v4.6.0)
--------------------------------------------

PR 2a includes the workarounds for two of three SDK limitations
that landed during full implementation (the third, Refresh:*bool,
only affects the delete path that ships in PR 2b):

- AliasExists method passes dataPointer=nil to its internal do(),
  which means non-2xx responses come back as a plain
  *errors.errorString ("status: 404 Not Found") rather than as
  *opensearch.StructError. aliasExists therefore inspects
  resp.StatusCode directly (resp is returned even when err is
  non-nil) and only falls back to wrapTransport for the "no
  response at all" case.
- sync.OnceReset is not in the standard library; the keyword-only
  index uses a mutex + ready/err flag pattern so transient failures
  can be retried by the next caller. The per-dimension path uses
  the `once map[int]*sync.Once` delete-and-recreate trick.

Test fixes folded in
--------------------

While doing a full `go test ./...` against PR 1-merged main, two
deterministic regressions surfaced that block a clean run-everything
signal. Both are unrelated to the driver and are folded into PR 2a
so the PR's own CI run is green:

(1) Follow-up to Tencent#1445 — fanout test missed the new normalizer policy
    (internal/application/service/knowledgebase_search_fanout_test.go,
    +46 / -6). Tencent#1445 changed EngineAwareNormalizer for ES /
    ElasticFaiss / OpenSearch / Weaviate / Postgres / SQLite /
    Qdrant / TencentVectorDB / Doris from (score+1)/2 to clamp01
    passthrough (those engines surface non-negative cosine to the
    normalizer per Lucene script_score non-negative invariant for
    ES, k-NN plugin SpaceType.COSINESIMIL.scoreTranslation for
    OpenSearch, engine-internal or IR-normalized conversions for
    the rest). Milvus is now the only engine that still surfaces
    raw signed cosine in [-1, 1].

    TestRetrieveFromStores_MixedEngine_Normalizes still asserted
    the old cosine-shift behaviour for ES (raw -0.4 → expected 0.3)
    which under passthrough now becomes clamp01(-0.4) = 0. The
    normalizer's own _test.go was updated at Tencent#1445 time, but this
    fan-out integration test was not.

    Fix: rewrite the godoc to spell out the two engine groups;
    restate sub-case 2 as ES passthrough on a production-possible
    mid-range cosine (0.3 → 0.3, PG out-ranks ES); add sub-case 3
    pinning the cosine-shift branch via Milvus -0.4 → 0.3.

(2) Pre-existing — SSRF whitelist singleton race surfaced by this run
    (internal/utils/security.go + internal/utils/security_test.go +
    internal/infrastructure/web_search/searxng_test.go,
    +33 / -9). loadSSRFWhitelist in internal/utils/security.go is
    cached via sync.Once on first call. The internal reset helper
    resetSSRFWhitelistForTest was unexported, so tests in other
    packages could not reset and saw whatever whitelist was cached
    by the first sync.Once.Do() in the same test binary. In
    internal/infrastructure/web_search/, TestValidateProxyURL runs
    before TestValidateSearxngBaseURL alphabetically and exercises
    ValidateURLForSSRF with no SSRF_WHITELIST set, caching an empty
    whitelist; the later setenv in searxng_test then has no effect
    and 127.0.0.1 is rejected with "hostname 127.0.0.1 is restricted".
    Pre-existing on main; surfaced now because this PR was the
    first to do a full `go test ./...` run on top of Tencent#1445.

    Fix: capitalize the helper to ResetSSRFWhitelistForTest (the
    ForTest suffix is the test-only contract); update in-package
    callers; in web_search/searxng_test.go import internal/utils
    and call ResetSSRFWhitelistForTest around the env mutation in
    both TestValidateSearxngBaseURL and TestSearxngProvider_Search.
    No production code path changes.

Roadmap
-------

- PR 2b (next, depends on this PR) — read/write implementations:
  query.go + retrieve.go + crud.go land their real bodies; stubs
  for Save / BatchSave / DeleteBy* / Retrieve in stubs.go are
  removed; corresponding CRUD/retrieve/filter test cases (~430
  LoC) join repository_test.go.
- PR 3 — activation switch + async paths (CopyIndices,
  BatchUpdate*, large-batch async deletes) + i18n + docker-compose
  dev profile. After PR 3 merges, the OpenSearch driver becomes
  reachable via either `engine_type=opensearch` VectorStore or
  `RETRIEVE_DRIVER=opensearch` env.

Backward compatibility
----------------------

- New package — additive only. No existing file modified except
  go.mod / go.sum, the two test files in (1)/(2), and the
  test-only export rename in utils/security.go.
- Driver is unreachable: no registry path activates it.
- No SQL migration.
- The PR 1 normalizer case for OpenSearch remains unreachable
  here (no driver instance produces a result yet).

Test plan
---------

- [x] go build ./... clean
- [x] go vet ./... clean
- [x] go test -race -count=1 ./internal/application/repository/retriever/opensearch/... passes
- [x] grep -r "case types.OpenSearchRetrieverEngineType" internal/
      shows only PR 1's normalizer case + this driver's EngineType()
      and tests — no activation path.
- [x] grep -r "case \"opensearch\"" internal/ shows no hits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant