Skip to content

feat(search): enable V3 entity index as dual search target alongside V2#17013

Open
loustler wants to merge 1 commit intodatahub-project:masterfrom
loustler:feat/search-enable-v3-queries
Open

feat(search): enable V3 entity index as dual search target alongside V2#17013
loustler wants to merge 1 commit intodatahub-project:masterfrom
loustler:feat/search-enable-v3-queries

Conversation

@loustler
Copy link
Copy Markdown
Contributor

Summary

DataHub's V3 entity index is currently write-only — data is indexed into V3 during ingestion and metadata changes, but search queries only target V2 per-entity indices. This means V3-only features (unified scoring, tiered search fields, custom analyzers) are not available to end users.

This PR enables the V3 shared entity index as a search target alongside existing V2 indices, making V3 data discoverable while maintaining full backward compatibility with V2. Every search query now targets both <entity>index_v2 and *index_v3 patterns simultaneously.

Depends on: PR #17009 (V3 entity type aggregation and filter compatibility layer) which handles result merging, deduplication, and aggregation normalization.

Changes

Existing Behavior — Modified

Search queries previously targeted only V2 per-entity indices (e.g., datasetindex_v2, chartindex_v2).

→ Now target both V2 and V3 indices via Stream.concat(v2Patterns, indexConvention.getV3EntityIndexPatterns().stream()). This applies to all five search paths:

Method Purpose
buildSearchRequest() Main search (search bar, API)
filter() Filter-based entity listing (admin pages, glossary)
buildAutocompleteRequest() Typeahead autocomplete suggestions
buildAggregateByValue() Facet aggregation for sidebar filters
buildScrollRequest() Paginated scroll through large result sets

New Behavior — Added

Entity type filter (applyEntityTypeFilter()): Since V3 is a shared multi-entity index, queries must be wrapped with an entity type filter to prevent cross-entity contamination. The filter uses a bool query with two should branches:

  • must_not exists _entityType — lets V2 documents (which lack this field) pass through
  • terms _entityType [entityNames] — restricts V3 documents to requested types

Entity names must use camelCase from EntitySpec.getName() (e.g., glossaryNode, corpUser) to match V3's stored _entityType values. Lowercase names produce 0 matches on V3.

V3 keyword subfield (MultiEntityMappingsBuilder): Added .keyword subfield for keyword-type fields in V3 mappings so aggregation queries like owners.keyword work consistently across V2 and V3 indices during the transition period.

Filesystem config loading (BaseConfigurationLoader): Added filesystem path fallback before classpath resource lookup. This enables loading external analyzer configurations (e.g., nori, kuromoji) from Kubernetes ConfigMap mounts at paths like /etc/datahub/analyzer-config.yaml without requiring them to be on the Java classpath.

Key Files Modified

File Change
metadata-io/.../ESSearchDAO.java V3 index pattern concatenation in all 5 search methods + applyEntityTypeFilter()
metadata-io/.../MultiEntityMappingsBuilder.java .keyword subfield for V3 keyword fields
metadata-io/.../BaseConfigurationLoader.java Filesystem-first config loading fallback
metadata-io/.../BaseConfigurationLoaderTest.java Test for filesystem path loading

Configuration

No new configuration required. V3 index patterns are derived from the existing IndexConvention.getV3EntityIndexPatterns() which is controlled by the ELASTICSEARCH_ENTITY_INDEX_V3_ENABLED environment variable. If V3 is not enabled, the V3 pattern matches no indices and queries behave identically to before.

Migration Notes

  • Backward compatible: Deployments without V3 indices see no behavior change (wildcard pattern matches nothing).
  • No reindexing required: V3 indices are populated by existing write paths; this PR only adds them to read paths.
  • Gradual rollout: V3 can be enabled/disabled without data loss since V2 indices remain the primary source of truth.

Checklist

  • The PR conforms to DataHub's Contributing Guideline
  • Links to related issue: N/A
  • Tests added/updated: BaseConfigurationLoaderTest for filesystem path loading

🤖 Generated with Claude Code

@github-actions github-actions bot added the product PR or Issue related to the DataHub UI/UX label Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Linear: PFP-3345

Thanks for your contribution! We have created an internal ticket to track this PR. A member of the core DataHub team will be assigned to review it within the next few business days - you will get a follow-up comment once a reviewer is assigned.

@github-actions github-actions bot added the community-contribution PR or Issue raised by member(s) of DataHub Community label Apr 14, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Bundle Report

Bundle size has no change ✅

@loustler loustler force-pushed the feat/search-enable-v3-queries branch from 1947dd4 to 421718f Compare April 14, 2026 07:53
@maggiehays maggiehays added the needs-review Label for PRs that need review from a maintainer. label Apr 14, 2026
@loustler loustler force-pushed the feat/search-enable-v3-queries branch from 421718f to ffabf60 Compare April 14, 2026 10:11
Add V3 shared entity index patterns to all five search query paths
(search, filter, autocomplete, aggregate, scroll) so V3-indexed data
is discoverable. Apply _entityType filter to restrict V3 results to
requested entity types. Add .keyword subfield to V3 keyword mappings
for aggregation compatibility. Support filesystem-first config loading
for external analyzer configurations (e.g., Kubernetes ConfigMap mounts).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community needs-review Label for PRs that need review from a maintainer. product PR or Issue related to the DataHub UI/UX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants