Skip to content

[RFC] Discriminator-based lookups #5074

@Swiddis

Description

@Swiddis

Problem Statement

The current lookup command is fine for simple use cases where there's a small amount of distinct datasets to reference. A more flexible interface for selecting subsets of data could build on this, and better enable performing lookups on large and varied datasets. This would give lookup an identity distinct from regular joins.

Current State

The current lookup command is based directly on indices and internally maps to a right join, giving it almost no non-syntax difference.

A nontrivial side effect of this is that you can put two completely distinct datasets in one index. Consider the case of small host and client datasets:

// hosts
{"host_key":"47ea938f-71d3-46e6-8624-67b4c199062f","host_name":"payment-service-prod-ilkr","region":"eu-central-1"}
{"host_key":"fb28dbb8-7848-417f-908f-917d9f562704","host_name":"inventory-service-prod-fqju","region":"us-west-2"}
{"host_key":"9acdd317-60b0-4fe8-8874-736619fe5097","host_name":"payment-service-prod-sltj","region":"eu-west-1"}
{"host_key":"507e80aa-9d97-46cd-892a-6ee584bc6fd2","host_name":"order-service-prod-bmxf","region":"us-west-2"}
{"host_key":"f7f32960-ecff-4421-b4c3-9e76359a64a7","host_name":"notification-service-dev-udfq","region":"eu-west-1"}
{"host_key":"4b30a0c6-750c-4d74-a629-a76f7c6e4c7c","host_name":"order-service-prod-kaxu","region":"us-east-1"}
{"host_key":"6bda1ffc-ffde-433d-9b55-4a30a9a0c2d7","host_name":"user-service-dev-idco","region":"ap-southeast-1"}
{"host_key":"fde7b948-fa49-4f93-852f-e76ac39a9c7b","host_name":"order-service-prod-yask","region":"us-east-1"}
{"host_key":"633126a4-d67a-4539-af8d-045aafd9046f","host_name":"notification-service-prod-zkrz","region":"us-east-1"}
{"host_key":"45df36bf-512f-49a1-a815-46d5842f6a66","host_name":"order-service-prod-ahlo","region":"us-west-2"}

// clients
{"client_key":"422031b6-5e79-44aa-ac00-ea5377316c94","client_region":"Wisconsin","device_type":"Mobile"}
{"client_key":"b89950f7-e4e8-498d-8d66-b7d8187759f6","client_region":"New York","device_type":"Tablet"}
{"client_key":"e58c063a-25cb-4014-b2ab-bb47ed5cf781","client_region":"Oklahoma","device_type":"Server"}
{"client_key":"dc0966a2-3989-43ae-9398-0c84c038eb2f","client_region":"Georgia","device_type":"Mobile"}
{"client_key":"baec27ff-e176-4d8c-8629-f2d71f300d52","client_region":"Wisconsin","device_type":"Desktop"}
{"client_key":"0fa0d263-e5de-4f59-a4fc-00eb22d150ce","client_region":"Rhode Island","device_type":"Mobile"}
{"client_key":"b0b8882d-e0eb-4471-a870-f3ae47a78f91","client_region":"Alaska","device_type":"Server"}
{"client_key":"cc459730-0661-4ae9-8645-ad83a8cbb31c","client_region":"Oklahoma","device_type":"Desktop"}
{"client_key":"184b19c9-dcfe-4afe-be09-defa54d3c0b5","client_region":"Tennessee","device_type":"Server"}
{"client_key":"263caf2b-c8fa-425e-b95e-450730208cfc","client_region":"Oklahoma","device_type":"Desktop"}

When ingesting both of these, lookups could distinctly select either dataset as they have no matching keys.

// host lookup
source = data | lookup dim_data host_key append device_type | ...

// client lookup
source = data | lookup dim_data client_key append region | ...

This also can more-or-less be applied to overlapping columns if the UUIDs are from distinct sources: after s/client_key/host_key/ on the above, lookup dim_data host_key append device_type would still be unambiguous and return equivalent results, since the data wouldn't contain any of the colliding IDs from other lookups.

This is a decently flexible system as-is. Further constraints to distinguish lookups could be applied after the query if there's some sort of distinguishing value. Consider if the dataset were combined to:

{"_lookup":"host","key":"1","host_name":"payment-service-prod-ilkr","region":"eu-central-1"}
{"_lookup":"host","key":"2","host_name":"inventory-service-prod-fqju","region":"us-west-2"}
{"_lookup":"host","key":"3","host_name":"payment-service-prod-sltj","region":"eu-west-1"}
{"_lookup":"client","key":"1","client_region":"Wisconsin","device_type":"Mobile"}
{"_lookup":"client","key":"2","client_region":"New York","device_type":"Tablet"}
{"_lookup":"client","key":"3","client_region":"Oklahoma","device_type":"Server"}

Then an unambiguous lookup is still possible:

source = data
| lookup dim_data key append host_name
| where _lookup = 'host'

But this requires that users manually maintain their own lookup discriminant field, and resolve any naming conflicts that may arise.

Long-Term Goals

  • For users: Make it easier to enrich queries with join-like semantics, without the overhead of dedicated indices for each category of data.
  • For admins: Simplify operational overhead when running clusters where many users want to use their own datasets for result enrichment.

Out of Scope

  • Not implicitly solving the 10k fields problem as part of lookups by default. Once dynamic fields is more mature, this can be done by using a wide field and doing something like lookup | spath.
  • Not providing a detailed plugin-defined model for access control. Lookup indices can be maintained by admins. In particular, a field-level security policy can be introduced that limits certain discriminant patterns to specific users.

Proposal

This RFC suggests two enhancements to PPL lookups:

  1. Provide support for lookup discriminants as first-class citizens integrated with the lookup command.
  2. Introduce inputlookup and outputlookup commands which simplify querying and managing lookup-enabled indices.

From the user's perspective, this means introducing lookups under indices with dot operations, similar to how SQL already models Spark database-tables.

// old
source = data
| lookup dim_data key append host_name
| where _lookup = 'host'

// new
source = data
| lookup dim_data.host key append host_name

Then these support easy read and write operations with input- and outputlookup.

inputlookup dim_data.host
| where like(host_name, 'payment-%')

source = hosts
| where like(host_name, 'payment-%')
| outputlookup dim_data.host

Internally, this relies on a lookup field for discrimination, such as _lookup (type keyword). This is implicitly used in these queries, and can be configured with a cluster setting such as plugins.query.lookup.default_field_name. At read time, the lookup field may be fetched from index metadata.

This introduces lookups as a thing that can be applicable to an index. We can then add lookup support to the frontend data importer. Indices without specified lookups (dim_data) are permitted for indices that don't have a lookup field.

When outputlookup faces a naming collision, it supports both an overwrite mode (default) and append mode (via outputlookup append=true).

Some other notes:

  • lookup commands should probably exclude the lookup discriminant field by default (I'm inclined to exclude, but it's simpler not to). Regular searches on the lookup indices will still treat the discriminants as normal fields.
  • In case the index/cluster config isn't correct or a dataset already has a useful discriminator field, it might be more flexible to allow users to specify lookup fields directly in the lookup commands. We could introduce a using syntax such as: lookup using _lookup dim_data.host.
  • If an index field happens to collide with the lookup field name (unlikely), there's no special handling: we just treat it as the lookup field. The user can fix this by setting a new field with the index config. As the discriminant is only applied for lookup commants (i.e. not any other searches), this should have no impact until the user tries to use lookups.

Approach

The implementation will loosely go in four stages:

  1. Extend the grammar and codegen for Lookup to support a configurable discriminant field by default, which can be filtered in the query as described above. Verify the performance is acceptable for larger lookup indices.
  2. Introduce an inputlookup command that integrates with this system.
  3. Introduce an outputlookup command that integrates with this system.
  4. Go through feedback rounds with the team, improve any clear gaps with configuration flexibility/performance. Write docs for lookup config.

Alternatives

Some alternatives have been historically discussed on this:

  1. Introduce a system index for lookups and lookup data, fully automating the management. This gives us maximum flexibility and safety to define complete lookup safety semantics, and create a dynamic lookup system that could scale arbitrarily. This was rejected due to implementation complexity, a full implementation would likely take multiple months.
  2. Maintain lookups as their own resource in a sort-of registry index, which still points out to other indices. This also is quite flexible in terms of access control and can simplify the operational load for users. It might also integrate with the Resource Sharing APIs introduced in 3.3. But it again introduces more complexity to define the resource index and keep it consistent with the user content, whereas this lookup system treats the data itself as the source of truth.
  3. Don't do anything. Punt all of this to become a UI problem. There might be a good way to mock this sort of setup with existing systems, if one is willing to accept lookup | where-style filters. I've mostly described this in the "Current State."

Implementation Discussion

1. Grammar Layer Changes

Create a new lookupTableSource rule to avoid affecting other commands:

lookupTableSource
   : tableQualifiedName (DOT lookupDiscriminant)?
   ;

lookupDiscriminant
   : ident
   ;

lookupCommand
   : LOOKUP lookupTableSource lookupMappingList ((APPEND | REPLACE | OUTPUT) outputCandidateList)?
   ;

2. AST Layer Changes

Extend the Lookup AST node (core/src/main/java/org/opensearch/sql/ast/tree/Lookup.java):

public class Lookup extends UnresolvedPlan {
  private UnresolvedPlan child;
  private final UnresolvedPlan lookupRelation;
  private final Map<String, String> mappingAliasMap;
  private final OutputStrategy outputStrategy;
  private final Map<String, String> outputAliasMap;

  // New field for discriminant
  @Nullable
  private final String discriminant;  // e.g., "host" from "dim_data.host"

  // Constructor updated to accept discriminant parameter
}

Update AstBuilder.visitLookupCommand (ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java:981):

@Override
public UnresolvedPlan visitLookupCommand(OpenSearchPPLParser.LookupCommandContext ctx) {
  // Extract table name and discriminant from tableSource
  String discriminant = null;
  UnresolvedExpression tableExpr = this.internalVisitExpression(ctx.lookupTableSource().tableQualifiedName());

  if (ctx.lookupTableSource().lookupDiscriminant() != null) {
    discriminant = ctx.lookupTableSource().lookupDiscriminant().getText();
  }

  Relation lookupRelation = new Relation(tableExpr);
  Lookup.OutputStrategy strategy = /* ... existing logic ... */;
  Map<String, String> mappingAliasMap = /* ... existing logic ... */;
  Map<String, String> outputAliasMap = /* ... existing logic ... */;

  return new Lookup(lookupRelation, mappingAliasMap, strategy, outputAliasMap, discriminant);
}

3. Calcite Translation Layer Changes

Extend CalciteRelNodeVisitor.visitLookup (core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java:1492):

After resolving the lookup relation and before the join, inject a filter for the discriminant:

@Override
public RelNode visitLookup(Lookup node, CalcitePlanContext context) {
  // 1. resolve source side
  visitChildren(node, context);
  List<String> sourceFieldsNames = context.relBuilder.peek().getRowType().getFieldNames();

  // 2. resolve lookup table
  analyze(node.getLookupRelation(), context);

  // 2a. NEW: Add discriminant filter if present
  if (node.getDiscriminant() != null) {
    String lookupFieldName = getLookupFieldName(context);  // From cluster setting or index metadata
    RexNode discriminantFilter = context.rexBuilder.equals(
      context.relBuilder.field(lookupFieldName),
      context.rexBuilder.literal(node.getDiscriminant())
    );
    context.relBuilder.filter(discriminantFilter);
  }

  // 3. Add projection for lookup table if needed
  JoinAndLookupUtils.addProjectionIfNecessary(node, context);

  // ... rest of existing logic ...
}

Add helper method to retrieve lookup field name:

private String getLookupFieldName(CalcitePlanContext context) {
  // Priority order:
  // 1. Check index metadata for configured lookup field
  // 2. Fall back to cluster setting plugins.query.lookup.default_field_name
  // 3. Error: if the config doesn't exist, lookups either aren't supported or were disabled

  String lookupField = context.indexMetadata.getLookupFieldName();
  if (lookupField == null) {
    lookupField = context.settings.getSettingValue(Settings.Key.LOOKUP_DEFAULT_FIELD_NAME);
  }
  return lookupField != null ? lookupField : /* Throw Error */;
}

The discriminant becomes a filter predicate (WHERE _lookup = 'host') that's pushed down to OpenSearch as part of the query optimization.

4. New Commands: inputlookup and outputlookup

Grammar additions:

// Add to commands list
pplCommands
   : /* ... existing commands ... */
   | inputlookupCommand
   | outputlookupCommand
   ;

// inputlookup: read from lookup table
inputlookupCommand
   : INPUTLOOKUP lookupTableSource (whereClause)?
   ;

// outputlookup: write to lookup table
outputlookupCommand
   : OUTPUTLOOKUP lookupTableSource (APPEND EQUAL booleanLiteral)?
   ;

AST nodes:

Create InputLookup and OutputLookup AST nodes:

// InputLookup.java
public class InputLookup extends UnresolvedPlan {
  private final UnresolvedPlan lookupRelation;
  @Nullable private final String discriminant;
  @Nullable private final UnresolvedExpression whereCondition;
}

// OutputLookup.java
public class OutputLookup extends UnresolvedPlan {
  private UnresolvedPlan child;
  private final UnresolvedPlan lookupRelation;
  @Nullable private final String discriminant;
  private final boolean appendMode;  // true = append, false = overwrite
}

Calcite translation:

InputLookup translates to:

  1. LogicalTableScan on the lookup index
  2. LogicalFilter with discriminant condition (if present)
  3. LogicalFilter with WHERE condition (if present)
  4. Optional LogicalProject to exclude the discriminant field from output

OutputLookup translates to:

  1. Process child plan (the data to write)
  2. Add discriminant field with literal value via LogicalProject
  3. Convert to a write operation (See note below)

Note: OutputLookup requires additional infrastructure beyond Calcite planning. This will likely be some sort of special operation for converting to a _bulk request that has appropriate size bound checks and collision handling. There will be a future RFC dedicated to this, feel free to call out specific concerns that should be addressed as part of that RFC.

5. Metadata and Configuration

Add new setting to Settings.java:

public enum Key {
  // ... existing settings ...
  LOOKUP_DEFAULT_FIELD_NAME("plugins.query.lookup.default_field_name", "_lookup");
}

Index Metadata Extension (if supporting per-index configuration -- open question):

Extend index metadata to store lookup field configuration:

  • Storage location: Index settings or custom metadata
  • Retrieval: During schema resolution in CalcitePlanContext
  • Fallback: Use cluster setting if not configured per-index

Per-index metadata provides flexibility but adds complexity. Initial implementation will use a cluster-wide setting only.

6. Validation and Error Handling

Grammar validation:

  • Ensure discriminant is identifier-safe (no special characters via ident rule)

Semantic validation (in Analyzer or CalciteRelNodeVisitor):

  • Verify lookup field exists in index schema
  • Verify lookup field is of type keyword (required for efficient filtering)
  • Error if discriminant specified but lookup field not found
  • Warning if lookup field exists but isn't indexed

Runtime considerations:

  • Discriminant filter should be pushed down to OpenSearch for efficiency

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageRFCRequest For Commentsfeature

Type

No type

Projects

Status

New

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions