feat(retriever): add OpenSearch driver skeleton with gated interface stubs#1481
Open
ochanism wants to merge 1 commit into
Open
feat(retriever): add OpenSearch driver skeleton with gated interface stubs#1481ochanism wants to merge 1 commit into
ochanism wants to merge 1 commit into
Conversation
… 2a of 3) First half of the gated OpenSearch k-NN driver introduced in PR 1 (Tencent#1445) by way of Tencent#1440. PR 2a ships a hollow, interface-compliant shell of the `internal/application/repository/retriever/opensearch/` package — every behavioural method (Save / BatchSave / DeleteBy* / Retrieve, plus the previously-stubbed CopyIndices / BatchUpdate* / EstimateStorageSize / swapToVersion) returns `ErrFeatureNotEnabled` or a conservative sentinel value. PR 2b lands the real read/write implementations in dedicated files (`query.go` + `retrieve.go` + `crud.go`) and replaces the stubs accordingly. Strict feature-gate (unchanged from PR 1): no entry is added to validEngineTypes / GetVectorStoreTypes / retrieverEngineMapping / BuildEnvVectorStores / container env path / engine factory switch, so the driver remains unreachable. Attempting to register an `engine_type=opensearch` VectorStore continues to fail with the existing "not a valid engine type" error. What lands in PR 2a ------------------- Driver skeleton (6 production files + 2 test files, ~1170 + ~1115 LoC): - `repository.go` — Repository struct + NewRepository constructor that validates cluster reachability + OS version (2.4+ / 3.x; primary tested 3.3.2) + k-NN plugin presence on every cluster node. sync.Once-guarded ensureReady(ctx, dim) for lazy per- dimension index creation, with transient errors not cached so a momentary cluster blip does not permanently poison a dim. sanitizeIndexName enforces a strict OS-compatible name spec. probeVersion uses robust strings.Split/Atoi parsing for pre-release suffixes and missing-patch versions. EngineType returns the PR 1 constant; Support returns [keywords, vector]. - `transport.go` — newOpenSearchClient ships TLS posture (MinVersion TLS 1.2, opt-in InsecureSkipVerify, forward-secrecy- only cipher list) and transport tuning for the driver. Caller exists only in PR 3 (container.go + engine_factory.go); PR 2a remains gated dead code. - `mapping.go` — buildIndexMapping(cfg, dim) produces the full knn_vector + HNSW + content-analyzer mapping with every *_id field as an explicit keyword and source_type as integer. buildKeywordsMapping ships the dim-less keyword-only index mapping used by the no-embedding save path. createIndexAndAlias creates <alias>_v1 and aliases <alias> to it, with best-effort orphan cleanup and mapping-drift detection. - `config.go` — internalCfg (value type) applying OpenSearch defaults (hnsw_m=16, ef_construction=100, ef_search=100, shards=4, replicas=1, engine=lucene). - `errors.go` — nine sentinels (ErrIndexNotFound, ErrDimensionMismatch, ErrAuth, ErrTransport, ErrVersionUnsupported, ErrConfigInvalid, ErrFeatureNotEnabled, ErrBatchTooLarge, ErrCircuitBreaker). Repository never imports apperrors; PR 3's engine factory wraps these to typed AppError 2200/2201. - `stubs.go` — every behavioural method returns ErrFeatureNotEnabled. EstimateStorageSize returns a conservative HNSW lower-bound estimate (not 0) so the Phase 2 KB-delete guard fails-closed for non-empty KBs. Tests (~1115 LoC, 50 cases): - `repository_test.go` — interface satisfaction, sentinel mapping, sanitizeIndexName positive/negative matrix, semver parsing (pre-release / missing-patch), buildIndexMapping JSON shape pin (Lucene + Faiss + Keywords), probeVersion matrix (OS 1.x / 2.2 / 2.5 / 2.11 / 3.x / 3.0.0-rc1 / ES rejection), probeKNNPlugin multi-node coverage, ensureReady concurrency + per-dim isolation + transient retry, NewRepository storeID validation, all 11 stubs (CopyIndices, BatchUpdate*, EstimateStorageSize, SwapToVersion + Save / BatchSave / Retrieve / DeleteBy*), wrapTransport sentinel mapping + leak guard, isNotFound / isAlreadyExistsError, drainAndClose / limitedDecode helpers. - `transport_test.go` — TLS defaults / opt-in InsecureSkipVerify / TLS 1.2 pinning / cipher list / transport tuning. Single dependency addition: github.com/opensearch-project/ opensearch-go/v4 v4.6.0 in go.mod/go.sum. SDK quirks discovered (opensearch-go v4.6.0) -------------------------------------------- PR 2a includes the workarounds for two of three SDK limitations that landed during full implementation (the third, Refresh:*bool, only affects the delete path that ships in PR 2b): - AliasExists method passes dataPointer=nil to its internal do(), which means non-2xx responses come back as a plain *errors.errorString ("status: 404 Not Found") rather than as *opensearch.StructError. aliasExists therefore inspects resp.StatusCode directly (resp is returned even when err is non-nil) and only falls back to wrapTransport for the "no response at all" case. - sync.OnceReset is not in the standard library; the keyword-only index uses a mutex + ready/err flag pattern so transient failures can be retried by the next caller. The per-dimension path uses the `once map[int]*sync.Once` delete-and-recreate trick. Test fixes folded in -------------------- While doing a full `go test ./...` against PR 1-merged main, two deterministic regressions surfaced that block a clean run-everything signal. Both are unrelated to the driver and are folded into PR 2a so the PR's own CI run is green: (1) Follow-up to Tencent#1445 — fanout test missed the new normalizer policy (internal/application/service/knowledgebase_search_fanout_test.go, +46 / -6). Tencent#1445 changed EngineAwareNormalizer for ES / ElasticFaiss / OpenSearch / Weaviate / Postgres / SQLite / Qdrant / TencentVectorDB / Doris from (score+1)/2 to clamp01 passthrough (those engines surface non-negative cosine to the normalizer per Lucene script_score non-negative invariant for ES, k-NN plugin SpaceType.COSINESIMIL.scoreTranslation for OpenSearch, engine-internal or IR-normalized conversions for the rest). Milvus is now the only engine that still surfaces raw signed cosine in [-1, 1]. TestRetrieveFromStores_MixedEngine_Normalizes still asserted the old cosine-shift behaviour for ES (raw -0.4 → expected 0.3) which under passthrough now becomes clamp01(-0.4) = 0. The normalizer's own _test.go was updated at Tencent#1445 time, but this fan-out integration test was not. Fix: rewrite the godoc to spell out the two engine groups; restate sub-case 2 as ES passthrough on a production-possible mid-range cosine (0.3 → 0.3, PG out-ranks ES); add sub-case 3 pinning the cosine-shift branch via Milvus -0.4 → 0.3. (2) Pre-existing — SSRF whitelist singleton race surfaced by this run (internal/utils/security.go + internal/utils/security_test.go + internal/infrastructure/web_search/searxng_test.go, +33 / -9). loadSSRFWhitelist in internal/utils/security.go is cached via sync.Once on first call. The internal reset helper resetSSRFWhitelistForTest was unexported, so tests in other packages could not reset and saw whatever whitelist was cached by the first sync.Once.Do() in the same test binary. In internal/infrastructure/web_search/, TestValidateProxyURL runs before TestValidateSearxngBaseURL alphabetically and exercises ValidateURLForSSRF with no SSRF_WHITELIST set, caching an empty whitelist; the later setenv in searxng_test then has no effect and 127.0.0.1 is rejected with "hostname 127.0.0.1 is restricted". Pre-existing on main; surfaced now because this PR was the first to do a full `go test ./...` run on top of Tencent#1445. Fix: capitalize the helper to ResetSSRFWhitelistForTest (the ForTest suffix is the test-only contract); update in-package callers; in web_search/searxng_test.go import internal/utils and call ResetSSRFWhitelistForTest around the env mutation in both TestValidateSearxngBaseURL and TestSearxngProvider_Search. No production code path changes. Roadmap ------- - PR 2b (next, depends on this PR) — read/write implementations: query.go + retrieve.go + crud.go land their real bodies; stubs for Save / BatchSave / DeleteBy* / Retrieve in stubs.go are removed; corresponding CRUD/retrieve/filter test cases (~430 LoC) join repository_test.go. - PR 3 — activation switch + async paths (CopyIndices, BatchUpdate*, large-batch async deletes) + i18n + docker-compose dev profile. After PR 3 merges, the OpenSearch driver becomes reachable via either `engine_type=opensearch` VectorStore or `RETRIEVE_DRIVER=opensearch` env. Backward compatibility ---------------------- - New package — additive only. No existing file modified except go.mod / go.sum, the two test files in (1)/(2), and the test-only export rename in utils/security.go. - Driver is unreachable: no registry path activates it. - No SQL migration. - The PR 1 normalizer case for OpenSearch remains unreachable here (no driver instance produces a result yet). Test plan --------- - [x] go build ./... clean - [x] go vet ./... clean - [x] go test -race -count=1 ./internal/application/repository/retriever/opensearch/... passes - [x] grep -r "case types.OpenSearchRetrieverEngineType" internal/ shows only PR 1's normalizer case + this driver's EngineType() and tests — no activation path. - [x] grep -r "case \"opensearch\"" internal/ shows no hits.
7225233 to
2075500
Compare
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lay down the OpenSearch k-NN driver package at
internal/application/repository/retriever/opensearch/as a hollow, interface-compliant shell, with a strict feature gate carried over from PR 1: every behavioural method (Save/BatchSave/Retrieve/DeleteBy*, plus the previously-stubbedCopyIndices/BatchUpdate*/swapToVersion) returnsErrFeatureNotEnabled, andEstimateStorageSizereturns a conservative HNSW lower-bound. No registry / factory / env path mentions this driver — attempting to register anengine_type=opensearchVectorStore continues to fail with the existing "not a valid engine type" error inherited from PR 1'svalidEngineTypesgate, andRETRIEVE_DRIVER=opensearchremains a silent no-op.The skeleton ships the connect-and-survey pieces of the driver (
NewRepositoryconstructor, TLS-aware HTTP transport, k-NN HNSW + content-analyzer index mapping, lazy per-dimension index creation, sentinel error taxonomy) and the structural test coverage that pins those pieces. PR 2b lands the real read/write implementations (query.go+retrieve.go+crud.go) by replacing theErrFeatureNotEnabledstubs in this PR'sstubs.gowith production code; PR 3 ships the activation switch + async paths + i18n + dev profile.Two test fixes are folded in to keep the run-everything CI signal green on top of PR 1-merged main — see the Test fixes folded in section below for the cover letter on each.
Context
Part of #1440 — Phase 3 (OpenSearch k-NN Native Vector Search Driver) of the multi-store roadmap.
Phase 3 PR sequence (this is PR 2a of 3, after the original PR 2 was split into 2a + 2b for reviewer-load balancing):
Depends on Phase 1 (#921, MERGED), Phase 2 (#993, all 5 PRs MERGED: #994, #1310, #1372, #1386, #1427), and Phase 3 PR 1 (#1445, MERGED). Every prerequisite is on
main. PR 2b and PR 3 land on top of this branch in sequence; the OpenSearch gate flips to "on" only when PR 3 merges.Changes at a glance
New package
internal/application/repository/retriever/opensearch/— 6 production files + 2 test files + 1 dependency. ~1170 LoC production + ~1115 LoC tests, all gated dead code on merge.repository.go—Repositorystruct +NewRepository(ctx, client, storeID, indexCfg) (repo, error)constructor that validates cluster reachability + OS version (2.4+ / 3.x; primary tested 3.3.2) + opensearch-knn plugin presence on every cluster node before returning.sync.Once-guardedensureReady(ctx, dim)lazily creates the per-dimension index on first use; transient errors (ErrTransport,ErrCircuitBreaker) are explicitly not cached so a momentary cluster blip does not permanently poison a dim.sanitizeIndexNameenforces a strict OS-compatible name spec (^[a-z0-9][a-z0-9_-]{0,254}$) and rejects wildcards / control chars / leading punctuation.probeVersionusesstrings.Split+strconv.Atoifor robust semver parsing across pre-release suffixes (3.0.0-rc1) and missing-patch forms (2.10); supported-version matrix is OS 2.4+ accept (with a warn-log for 2.42.10 below LTS), OS 1.x / 2.02.3 / Elasticsearch reject.EngineType()returns the constant added in PR 1;Support()returns[keywords, vector].transport.go—newOpenSearchClient(cfg *ConnectionConfig) (*osapi.Client, error)ships the TLS posture (MinVersion: TLS 1.2, opt-inInsecureSkipVerifyreading from the field added in PR 1, forward-secrecy-only cipher list excludingTLS_RSA_*) and the transport tuning (per-host idle pool, idle / response timeouts) for the driver. The only callers will be PR 3'scontainer.go+engine_factory.go; PR 2a remains gated dead code.mapping.go—buildIndexMapping(cfg, dim)produces the fullknn_vector+ HNSW + content-analyzer mapping with every*_idfield as an explicitkeyword(avoiding the ES v8.keywordsuffix auto-detection dance) andsource_typeasinteger(stable acrossSourceTypeenum extension).buildKeywordsMappingships the dim-less keyword-only index mapping for the no-embedding save path.createIndexAndAliascreates<alias>_v1and aliases<alias>to it, with best-effort orphan cleanup onaliasPutfailure and mapping-drift detection onresource_already_exists_exception.config.go—internalCfg(value type, not pointer) applying OpenSearch defaults that match the entry shipped by PR 3 (hnsw_m=16,ef_construction=100,ef_search=100,shards=4,replicas=1,engine=lucene).errors.go— nine sentinels:ErrIndexNotFound,ErrDimensionMismatch,ErrAuth,ErrTransport,ErrVersionUnsupported,ErrConfigInvalid,ErrFeatureNotEnabled,ErrBatchTooLarge,ErrCircuitBreaker. The repository never importsinternal/errors; PR 3's engine factory wraps these sentinels to typedAppError 2200 / 2201.stubs.go— every behavioural method returnsErrFeatureNotEnabled.EstimateStorageSizereturns a conservative HNSW lower-bound estimate (N * (1024 + 4*768 + 128)bytes per chunk, the 768 being a default-conservative dim guess) so the Phase 2 KB-delete guard fails closed for non-empty KBs rather than treating them as empty.repository_test.go(~970 LoC) — 50 cases covering interface satisfaction, sentinel mapping,sanitizeIndexNamepositive/negative matrix, semver parsing (pre-release / missing-patch),buildIndexMappingJSON shape pin (Lucene + Faiss + Keywords mapping),probeVersionmatrix (OS 1.x / 2.2 / 2.5 / 2.11 / 3.x / 3.0.0-rc1 / ES rejection),probeKNNPluginmulti-node coverage,ensureReadyconcurrency + per-dim isolation + transient retry,NewRepositorystoreID validation, all 11 stubs (CopyIndices/BatchUpdate*/EstimateStorageSize/swapToVersion+Save/BatchSave/Retrieve/DeleteBy*),wrapTransportsentinel mapping + cluster-side-Reason leak guard,isNotFound/isAlreadyExistsError,drainAndClose/limitedDecodehelpers.transport_test.go(~137 LoC) — TLS defaults (verify on, noInsecureSkipVerify), opt-inInsecureSkipVerify(only via the explicit config field), TLS 1.2 minimum pinning, cipher list excludesTLS_RSA_*, transport tuning fields.go.mod/go.sum— single dependency addition:github.com/opensearch-project/opensearch-go/v4 v4.6.0. The SDK speaks to both OS 2.x and 3.x clusters with the same wire protocol for the 10 APIs this driver uses (see #1440's cross-version compatibility table). Transitive deps pull in onlygo-querystring; no AWS SDK (we don't use AWS Sigv4 transport).Test fixes folded in
Two deterministic regressions surfaced on a full
go test ./...against PR 1-merged main. Both are unrelated to the driver code above but block a clean run-everything signal, so they ride along here.(1) Follow-up to #1445 — fanout test missed the new normalizer policy (
internal/application/service/knowledgebase_search_fanout_test.go, +46 / −6).PR 1 changed
EngineAwareNormalizerfor Elasticsearch / ElasticFaiss / OpenSearch / Weaviate / Postgres / SQLite / Qdrant / TencentVectorDB / Doris from(score + 1) / 2(cosine-shift) toclamp01(score)(passthrough). The grounds were that those engines surface non-negative cosine to the normalizer (Lucenescript_scorenon-negative invariant for ES; k-NN pluginSpaceType.COSINESIMIL.scoreTranslationfor OpenSearch; engine-internal conversions or IR-normalized embeddings for the rest). Milvus is now the only engine that still surfaces raw signed cosine in[-1, 1].TestRetrieveFromStores_MixedEngine_Normalizeswas authored before #1445 and still asserted the old cosine-shift behaviour for ES (raw-0.4→ expected0.3). Under the new passthrough policy that input now maps toclamp01(-0.4) = 0, breaking the test. PR 1'snormalizer_test.gowas updated at amend time, but this fan-out integration test in theservice/package was missed.The fix: rewrite the godoc preamble to spell out which engines passthrough vs cosine-shift; restate sub-case 2 as ES passthrough on a production-possible mid-range cosine (
0.3 → 0.3, PG out-ranks ES — preserving the original ranking property); add sub-case 3 pinning the cosine-shift branch via Milvus-0.4 → 0.3so the[-1, 1]arm stays under integration-test coverage.(2) Pre-existing — SSRF whitelist singleton race surfaced by this run (
internal/utils/security.go+internal/utils/security_test.go+internal/infrastructure/web_search/searxng_test.go, +33 / −9 across 3 files).loadSSRFWhitelistininternal/utils/security.gocaches its parse ofSSRF_WHITELISTviasync.Onceon first call. Tests insideinternal/utils/reset the singleton between cases viaresetSSRFWhitelistForTest, but that helper was unexported, so tests in other packages could not reset and ended up seeing whatever whitelist was cached by the firstsync.Once.Do()within the same test binary.The concrete failure: in
internal/infrastructure/web_search/, the test binary contains bothproxy_test.goandsearxng_test.go. By alphabetical orderTestValidateProxyURLruns beforeTestValidateSearxngBaseURL.TestValidateProxyURLexercisesValidateURLForSSRFwith noSSRF_WHITELISTset, which causesloadSSRFWhitelistto cache an empty whitelist for the rest of the process. The laterTestValidateSearxngBaseURL/loopback_ok_via_whitelistsub-case then setsSSRF_WHITELIST=127.0.0.1,localhostand expects127.0.0.1to be whitelisted — but the singleton is already cached empty, soValidateURLForSSRFstill rejects127.0.0.1with\"hostname 127.0.0.1 is restricted\"and the test fails. Running the searxng tests in isolation (-run TestValidateSearxngBaseURL) hides the bug becauseproxy_testnever runs to seed the singleton.This is pre-existing on
main; surfaced now because this branch is the first to do a fullgo test ./...run on top of #1445.The fix: capitalize the helper to
ResetSSRFWhitelistForTest(theForTestsuffix is the test-only contract); update the in-package callers inutils/security_test.goto the new name; inweb_search/searxng_test.go, importinternal/utilsand callutils.ResetSSRFWhitelistForTest()around the env mutation in bothTestValidateSearxngBaseURLandTestSearxngProvider_Search(reset on both entry and exit so the singleton state andSSRF_WHITELISTenv stay in sync for subsequent tests in the same binary). No production code path changes — the reset helper is still test-only by contract, and no caller outside*_test.goinvokes it.SDK quirks observed against opensearch-go v4.6.0
Two of the three quirks discovered during full implementation surface in PR 2a's code (the third,
Refresh: *bool, only affects the delete path that ships in PR 2b):client.Indices.Alias.Existsreturns plain*errors.errorStringon non-2xx. The SDK passesdataPointer=nilto its internaldo(), so 404 responses come back as\"status: 404 Not Found\"rather than as the typed*opensearch.StructErrorthat the rest of this package's error classification expects.aliasExistsinspectsresp.StatusCodedirectly (resp is returned even when err is non-nil) and only falls back towrapTransportfor the "no response at all" case. This is the one intentional exception to the package's "typed errors only, no string matching" policy — the status code is read off*http.Response, not the error string.sync.OnceResetis not in the standard library. Honouring the transient-vs-permanent error policy is straightforward for the per-dimension path becausesync.Onceinstances live in amap[int]*sync.Oncewe can mutate (delete the entry to reset). The keyword-only fallback path (<base>_keywords, no per-dim variant) has only one such instance, so it uses async.Mutex+keywordsReady bool+keywordsErr errorflag pair instead of a baresync.Once. The transient retry semantics are identical.Backward compatibility
go.mod/go.sum, the two test files in the Test fixes folded in section, and the test-only export rename inutils/security.go.engine_type=opensearchVectorStore creation continues to fail with the "not a valid engine type" error from PR 1'svalidEngineTypesgate.RETRIEVE_DRIVER=opensearchboot remains a silent no-op (the env-store builder has noopensearcharm). PR 1'sTestOpenSearchRetrieverEngineType_NotInValidEngineTypesis the standing regression guard for the gated invariant.opensearch-go/v4 v4.6.0);govulncheck ./...recommended as a pre-merge gate.Test plan
go build ./...— cleango vet ./...— cleangofmt -lon touched files — cleango test -race -count=1 ./internal/application/repository/retriever/opensearch/...— passes (50 cases, ~1115 LoC of test code)go test ./...on top of PR 1-mergedmain— the only remaining failure isTestOssEnsureBucket_CreateFailsininternal/application/service/file/, which is pre-existing onmainand fails identically againstupstream/mainbefore this PR (depends on Aliyun OSS endpoint reachability + invalid-credentials response shape; this PR touches zero files in that package). The two test fixes folded in restore green onservice/(fanout) andweb_search/(SSRF singleton).validEngineTypesmap ininternal/types/vectorstore.gohas noOpenSearchRetrieverEngineTypeentry, so(*VectorStore).Validate()returns\"unsupported engine type: opensearch\"for any registration attempt.grep -r \"case types.OpenSearchRetrieverEngineType\" internal/shows only PR 1's normalizer case + this driver's ownEngineType()and tests — no activation path.grep -r \"case \\\"opensearch\\\"\" internal/shows no hits incontainer.go/engine_factory.go/BuildEnvVectorStores.