[RFC] Discriminator-based lookups

## **Problem Statement**
The current lookup command is fine for simple use cases where there's a small amount of distinct datasets to reference. A more flexible interface for selecting subsets of data could build on this, and better enable performing lookups on large and varied datasets. This would give `lookup` an identity distinct from regular joins.

## **Current State**
The current `lookup` command is based directly on indices and internally maps to a right join, giving it almost no non-syntax difference.

A nontrivial side effect of this is that you can put two completely distinct datasets in one index. Consider the case of small host and client datasets:

```json5
// hosts
{"host_key":"47ea938f-71d3-46e6-8624-67b4c199062f","host_name":"payment-service-prod-ilkr","region":"eu-central-1"}
{"host_key":"fb28dbb8-7848-417f-908f-917d9f562704","host_name":"inventory-service-prod-fqju","region":"us-west-2"}
{"host_key":"9acdd317-60b0-4fe8-8874-736619fe5097","host_name":"payment-service-prod-sltj","region":"eu-west-1"}
{"host_key":"507e80aa-9d97-46cd-892a-6ee584bc6fd2","host_name":"order-service-prod-bmxf","region":"us-west-2"}
{"host_key":"f7f32960-ecff-4421-b4c3-9e76359a64a7","host_name":"notification-service-dev-udfq","region":"eu-west-1"}
{"host_key":"4b30a0c6-750c-4d74-a629-a76f7c6e4c7c","host_name":"order-service-prod-kaxu","region":"us-east-1"}
{"host_key":"6bda1ffc-ffde-433d-9b55-4a30a9a0c2d7","host_name":"user-service-dev-idco","region":"ap-southeast-1"}
{"host_key":"fde7b948-fa49-4f93-852f-e76ac39a9c7b","host_name":"order-service-prod-yask","region":"us-east-1"}
{"host_key":"633126a4-d67a-4539-af8d-045aafd9046f","host_name":"notification-service-prod-zkrz","region":"us-east-1"}
{"host_key":"45df36bf-512f-49a1-a815-46d5842f6a66","host_name":"order-service-prod-ahlo","region":"us-west-2"}

// clients
{"client_key":"422031b6-5e79-44aa-ac00-ea5377316c94","client_region":"Wisconsin","device_type":"Mobile"}
{"client_key":"b89950f7-e4e8-498d-8d66-b7d8187759f6","client_region":"New York","device_type":"Tablet"}
{"client_key":"e58c063a-25cb-4014-b2ab-bb47ed5cf781","client_region":"Oklahoma","device_type":"Server"}
{"client_key":"dc0966a2-3989-43ae-9398-0c84c038eb2f","client_region":"Georgia","device_type":"Mobile"}
{"client_key":"baec27ff-e176-4d8c-8629-f2d71f300d52","client_region":"Wisconsin","device_type":"Desktop"}
{"client_key":"0fa0d263-e5de-4f59-a4fc-00eb22d150ce","client_region":"Rhode Island","device_type":"Mobile"}
{"client_key":"b0b8882d-e0eb-4471-a870-f3ae47a78f91","client_region":"Alaska","device_type":"Server"}
{"client_key":"cc459730-0661-4ae9-8645-ad83a8cbb31c","client_region":"Oklahoma","device_type":"Desktop"}
{"client_key":"184b19c9-dcfe-4afe-be09-defa54d3c0b5","client_region":"Tennessee","device_type":"Server"}
{"client_key":"263caf2b-c8fa-425e-b95e-450730208cfc","client_region":"Oklahoma","device_type":"Desktop"}
```

When ingesting both of these, lookups could distinctly select either dataset as they have no matching keys.

```kusto
// host lookup
source = data | lookup dim_data host_key append device_type | ...

// client lookup
source = data | lookup dim_data client_key append region | ...
```

This also can more-or-less be applied to overlapping columns if the UUIDs are from distinct sources: after `s/client_key/host_key/` on the above, `lookup dim_data host_key append device_type` would still be unambiguous and return equivalent results, since the data wouldn't contain any of the colliding IDs from other lookups.

This is a decently flexible system as-is. Further constraints to distinguish lookups could be applied after the query if there's some sort of distinguishing value. Consider if the dataset were combined to:

```json5
{"_lookup":"host","key":"1","host_name":"payment-service-prod-ilkr","region":"eu-central-1"}
{"_lookup":"host","key":"2","host_name":"inventory-service-prod-fqju","region":"us-west-2"}
{"_lookup":"host","key":"3","host_name":"payment-service-prod-sltj","region":"eu-west-1"}
{"_lookup":"client","key":"1","client_region":"Wisconsin","device_type":"Mobile"}
{"_lookup":"client","key":"2","client_region":"New York","device_type":"Tablet"}
{"_lookup":"client","key":"3","client_region":"Oklahoma","device_type":"Server"}
```

Then an unambiguous lookup is still possible:

```kusto
source = data
| lookup dim_data key append host_name
| where _lookup = 'host'
```

But this requires that users manually maintain their own lookup discriminant field, and resolve any naming conflicts that may arise.

## **Long-Term Goals**
* For users: Make it easier to enrich queries with join-like semantics, without the overhead of dedicated indices for each category of data.
* For admins: Simplify operational overhead when running clusters where many users want to use their own datasets for result enrichment.

### Out of Scope
- Not implicitly solving the 10k fields problem as part of lookups by default. Once dynamic fields is more mature, this can be done by using a wide field and doing something like `lookup | spath`.
- Not providing a detailed plugin-defined model for access control. Lookup indices can be maintained by admins. In particular, a field-level security policy can be introduced that limits certain discriminant patterns to specific users.
## **Proposal**
This RFC suggests two enhancements to PPL lookups:
1. Provide support for lookup discriminants as first-class citizens integrated with the lookup command.
2. Introduce `inputlookup` and `outputlookup` commands which simplify querying and managing lookup-enabled indices.

From the user's perspective, this means introducing lookups under indices with dot operations, similar to how SQL already models Spark database-tables.

```kusto
// old
source = data
| lookup dim_data key append host_name
| where _lookup = 'host'

// new
source = data
| lookup dim_data.host key append host_name
```

Then these support easy read and write operations with `input-` and `outputlookup`.

```kql
inputlookup dim_data.host
| where like(host_name, 'payment-%')

source = hosts
| where like(host_name, 'payment-%')
| outputlookup dim_data.host
```

Internally, this relies on a lookup field for discrimination, such as `_lookup` (type keyword). This is implicitly used in these queries, and can be configured with a cluster setting such as `plugins.query.lookup.default_field_name`. At read time, the lookup field may be fetched from index metadata.

This introduces lookups as a thing that can be applicable to an index. We can then add lookup support to the frontend data importer. Indices without specified lookups (`dim_data`) are permitted for indices that don't have a lookup field.

When `outputlookup` faces a naming collision, it supports both an overwrite mode (default) and append mode (via `outputlookup append=true`).

**Some other notes:**
- `lookup` commands should probably exclude the lookup discriminant field by default (I'm inclined to exclude, but it's simpler not to). Regular searches on the lookup indices will still treat the discriminants as normal fields.
- In case the index/cluster config isn't correct or a dataset already has a useful discriminator field, it might be more flexible to allow users to specify lookup fields directly in the lookup commands. We could introduce a `using` syntax such as: `lookup using _lookup dim_data.host`.
- If an index field happens to collide with the lookup field name (unlikely), there's no special handling: we just treat it as the lookup field. The user can fix this by setting a new field with the index config. As the discriminant is only applied for lookup commants (i.e. not any other searches), this should have no impact until the user tries to use lookups.

## **Approach**
The implementation will loosely go in four stages:
1. Extend the grammar and codegen for Lookup to support a configurable discriminant field by default, which can be filtered in the query as described above. Verify the performance is acceptable for larger lookup indices.
2. Introduce an `inputlookup` command that integrates with this system.
3. Introduce an `outputlookup` command that integrates with this system.
4. Go through feedback rounds with the team, improve any clear gaps with configuration flexibility/performance. Write docs for lookup config.

## **Alternatives**
Some alternatives have been historically discussed on this:
1. Introduce a system index for lookups and lookup data, fully automating the management. This gives us maximum flexibility and safety to define complete lookup safety semantics, and create a dynamic lookup system that could scale arbitrarily. This was rejected due to implementation complexity, a full implementation would likely take multiple months.
2. Maintain lookups as their own resource in a sort-of registry index, which still points out to other indices. This also is quite flexible in terms of access control and can simplify the operational load for users. It might also integrate with the [Resource Sharing APIs introduced in 3.3](https://docs.opensearch.org/latest/security/access-control/resource-sharing-api/). But it again introduces more complexity to define the resource index and keep it consistent with the user content, whereas this lookup system treats the data itself as the source of truth.
3. Don't do anything. Punt all of this to become a UI problem. There might be a good way to mock this sort of setup with existing systems, if one is willing to accept `lookup | where`-style filters. I've mostly described this in the "Current State."

## **Implementation Discussion**
### 1. Grammar Layer Changes

Create a new `lookupTableSource` rule to avoid affecting other commands:

```antlr
lookupTableSource
   : tableQualifiedName (DOT lookupDiscriminant)?
   ;

lookupDiscriminant
   : ident
   ;

lookupCommand
   : LOOKUP lookupTableSource lookupMappingList ((APPEND | REPLACE | OUTPUT) outputCandidateList)?
   ;
```
### 2. AST Layer Changes

Extend the `Lookup` AST node (core/src/main/java/org/opensearch/sql/ast/tree/Lookup.java):

```java
public class Lookup extends UnresolvedPlan {
  private UnresolvedPlan child;
  private final UnresolvedPlan lookupRelation;
  private final Map<String, String> mappingAliasMap;
  private final OutputStrategy outputStrategy;
  private final Map<String, String> outputAliasMap;

  // New field for discriminant
  @Nullable
  private final String discriminant;  // e.g., "host" from "dim_data.host"

  // Constructor updated to accept discriminant parameter
}
```

Update `AstBuilder.visitLookupCommand` (ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java:981):

```java
@Override
public UnresolvedPlan visitLookupCommand(OpenSearchPPLParser.LookupCommandContext ctx) {
  // Extract table name and discriminant from tableSource
  String discriminant = null;
  UnresolvedExpression tableExpr = this.internalVisitExpression(ctx.lookupTableSource().tableQualifiedName());

  if (ctx.lookupTableSource().lookupDiscriminant() != null) {
    discriminant = ctx.lookupTableSource().lookupDiscriminant().getText();
  }

  Relation lookupRelation = new Relation(tableExpr);
  Lookup.OutputStrategy strategy = /* ... existing logic ... */;
  Map<String, String> mappingAliasMap = /* ... existing logic ... */;
  Map<String, String> outputAliasMap = /* ... existing logic ... */;

  return new Lookup(lookupRelation, mappingAliasMap, strategy, outputAliasMap, discriminant);
}
```
### 3. Calcite Translation Layer Changes

Extend `CalciteRelNodeVisitor.visitLookup` (core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java:1492):

After resolving the lookup relation and before the join, inject a filter for the discriminant:

```java
@Override
public RelNode visitLookup(Lookup node, CalcitePlanContext context) {
  // 1. resolve source side
  visitChildren(node, context);
  List<String> sourceFieldsNames = context.relBuilder.peek().getRowType().getFieldNames();

  // 2. resolve lookup table
  analyze(node.getLookupRelation(), context);

  // 2a. NEW: Add discriminant filter if present
  if (node.getDiscriminant() != null) {
    String lookupFieldName = getLookupFieldName(context);  // From cluster setting or index metadata
    RexNode discriminantFilter = context.rexBuilder.equals(
      context.relBuilder.field(lookupFieldName),
      context.rexBuilder.literal(node.getDiscriminant())
    );
    context.relBuilder.filter(discriminantFilter);
  }

  // 3. Add projection for lookup table if needed
  JoinAndLookupUtils.addProjectionIfNecessary(node, context);

  // ... rest of existing logic ...
}
```

Add helper method to retrieve lookup field name:

```java
private String getLookupFieldName(CalcitePlanContext context) {
  // Priority order:
  // 1. Check index metadata for configured lookup field
  // 2. Fall back to cluster setting plugins.query.lookup.default_field_name
  // 3. Error: if the config doesn't exist, lookups either aren't supported or were disabled

  String lookupField = context.indexMetadata.getLookupFieldName();
  if (lookupField == null) {
    lookupField = context.settings.getSettingValue(Settings.Key.LOOKUP_DEFAULT_FIELD_NAME);
  }
  return lookupField != null ? lookupField : /* Throw Error */;
}
```

The discriminant becomes a filter predicate (`WHERE _lookup = 'host'`) that's pushed down to OpenSearch as part of the query optimization.
### 4. New Commands: inputlookup and outputlookup

Grammar additions:

```antlr
// Add to commands list
pplCommands
   : /* ... existing commands ... */
   | inputlookupCommand
   | outputlookupCommand
   ;

// inputlookup: read from lookup table
inputlookupCommand
   : INPUTLOOKUP lookupTableSource (whereClause)?
   ;

// outputlookup: write to lookup table
outputlookupCommand
   : OUTPUTLOOKUP lookupTableSource (APPEND EQUAL booleanLiteral)?
   ;
```

**AST nodes**:

Create `InputLookup` and `OutputLookup` AST nodes:

```java
// InputLookup.java
public class InputLookup extends UnresolvedPlan {
  private final UnresolvedPlan lookupRelation;
  @Nullable private final String discriminant;
  @Nullable private final UnresolvedExpression whereCondition;
}

// OutputLookup.java
public class OutputLookup extends UnresolvedPlan {
  private UnresolvedPlan child;
  private final UnresolvedPlan lookupRelation;
  @Nullable private final String discriminant;
  private final boolean appendMode;  // true = append, false = overwrite
}
```

**Calcite translation**:

`InputLookup` translates to:
1. `LogicalTableScan` on the lookup index
2. `LogicalFilter` with discriminant condition (if present)
3. `LogicalFilter` with WHERE condition (if present)
4. Optional `LogicalProject` to exclude the discriminant field from output

`OutputLookup` translates to:
1. Process child plan (the data to write)
2. Add discriminant field with literal value via `LogicalProject`
3. Convert to a write operation (See note below)

**Note**: `OutputLookup` requires additional infrastructure beyond Calcite planning. This will likely be some sort of special operation for converting to a `_bulk` request that has appropriate size bound checks and collision handling. There will be a future RFC dedicated to this, feel free to call out specific concerns that should be addressed as part of that RFC.

### 5. Metadata and Configuration

Add new setting to `Settings.java`:

```java
public enum Key {
  // ... existing settings ...
  LOOKUP_DEFAULT_FIELD_NAME("plugins.query.lookup.default_field_name", "_lookup");
}
```

Index Metadata Extension (if supporting per-index configuration -- open question):

Extend index metadata to store lookup field configuration:
- Storage location: Index settings or custom metadata
- Retrieval: During schema resolution in `CalcitePlanContext`
- Fallback: Use cluster setting if not configured per-index

Per-index metadata provides flexibility but adds complexity. Initial implementation will use a cluster-wide setting only.
### 6. Validation and Error Handling

Grammar validation:
- Ensure discriminant is identifier-safe (no special characters via `ident` rule)

Semantic validation (in Analyzer or CalciteRelNodeVisitor):
- Verify lookup field exists in index schema
- Verify lookup field is of type `keyword` (required for efficient filtering)
- Error if discriminant specified but lookup field not found
- Warning if lookup field exists but isn't indexed

Runtime considerations:
- Discriminant filter should be pushed down to OpenSearch for efficiency


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Discriminator-based lookups #5074

Problem Statement

Current State

Long-Term Goals

Out of Scope

Proposal

Approach

Alternatives

Implementation Discussion

1. Grammar Layer Changes

2. AST Layer Changes

3. Calcite Translation Layer Changes

4. New Commands: inputlookup and outputlookup

5. Metadata and Configuration

6. Validation and Error Handling

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Discriminator-based lookups #5074

Description

Problem Statement

Current State

Long-Term Goals

Out of Scope

Proposal

Approach

Alternatives

Implementation Discussion

1. Grammar Layer Changes

2. AST Layer Changes

3. Calcite Translation Layer Changes

4. New Commands: inputlookup and outputlookup

5. Metadata and Configuration

6. Validation and Error Handling

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions