-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Problem Statement
OpenSearch PPL is strong-schema and fail-fast. When data is dirty or drifting, queries abort instead of degrading gracefully. At scale, this makes analytics and operations brittle.
- Type/data mismatch. PPL evaluates expressions on strictly typed fields derived from mappings. Mixed representations—e.g.,
"1"
(string),[1]
(array) trigger cast/class errors that terminate the query and break dashboards. Operators lack TRY/SAFE semantics, so evaluation cannot proceed null-on-error. Example:source=log* | eval xx=flags+1
fails when any shard holds"1"
or[1]
inflags
; desired behavior is2
orNULL
with an error counter. - Schema-missing (
dynamic:false
) fields are invisible to PPL. PPL relies on mappings for field addressability and typing. When objects are mapped withdynamic:false
, their subfields are present in_source
but unmapped, so PPL cannot reference them, even though the Query DSL can project them from_source
. This creates a usability gap, forces reindex/mapping changes, or pushes users back to DSL. It also pressures teams toward mapping explosion for discoverability. Example: objectevent
withdynamic:false
;fields event.user
returns 400 in PPL, while DSL with"_source": ["event.user"]
succeeds. - Explict cast required when access extract field, When using the REX/PARSE/SPATH command to extract numeric values and then performing arithmetic operations on them, the system fails because the extracted values are treated as strings rather than numbers.
- Cross-index type conflicts. In observability datasets, schemas drift: the same logical field may be
string
in some indices andlong
in others (often from JSON/text extraction). When a PPL query spans those indices, the engine effectively picks one variant to type the field, which is non-deterministic and error-prone.
Requirements
- Coverage: Handle type/data mismatches, dynamic:false invisibility, and cross-index type conflicts under permissive TRY/SAFE semantics.
- Performance: No performance regression.
- Reusability: Engine-agnostic design reusable for PPL on other engines, including Spark.
Expected Behaviour
Setting: ppl.query.permissive=true, default is false, true enable permissive mode.
Scenario: intA * stringB
where intA
is mapped as integer
and stringB
is unmapped.
Analyze stage, The planner resolves intA as type INTEGER. Since stringB is unmapped, the planner treats it as ANY (unknown at plan time).
Function resolution stage, The engine selects the multiplication function with signature (INTEGER, INTEGER). It automatically inserts safe casts so that both operands conform to the required types. The final expression is *( safe_cast(intA AS INTEGER), safe_cast(stringB AS INTEGER) )
.
Data scan and load, Rows are read from OpenSearch as Object[]
. No premature casting occurs during scanning.
Data processing (compiled expression), For each row, the engine attempts to convert intA
and stringB
to integers. If conversion fails for either operand, the value becomes NULL
and an error counter is incremented for telemetry. If both casts succeed, the product is computed. If either cast fails, the result is NULL
.
public Object current() {
final Object[] current = (Object[]) inputEnumerator.current();
final Integer v1 = current[0] == null ? null : (Integer) new org.apache.calcite.linq4j.function.Function0() {
public Object apply() {
try {
return org.apache.calcite.runtime.SqlFunctions.toInt(current[0]);
} catch (Exception e) {
return null;
}
}
}.apply()
final Integer v2 = current[1] == null ? null : (Integer) new org.apache.calcite.linq4j.function.Function0() {
public Object apply() {
try {
return org.apache.calcite.runtime.SqlFunctions.toInt(current[1]);
} catch (Exception e) {
return null;
}
}
}.apply();
return new Object[] {Integer.valueOf(v1 * v2)};
}
Results, PPL correctly process data and tolerate type/data mismatch. e.g.
intA | stringB | intA * stringB | comments |
---|---|---|---|
2 | "3" | 6 | |
"2" | "3" | 6 | "2" can convert to int, return 6 |
2 | "3a" | null | 3a can not be cast to int, return null |
[2, 3, 4] | "3" | null | array [2,3,4] can not be cast to int, return null |
Decision Points
1.Type to use for missing or unknown fields
Two models are viable.
- String (Java
String
) may require extra parsing when numeric or boolean semantics are needed. - Any (Java
Object
) minimizes unnecessary conversions but demands strict runtime checks. (Preferred)
2.Function resolution for unknown fields
Every function has one or more signatures. During resolution, the engine maps input field types, including unknowns, to compatible target types and inserts casts as needed. To handle type and data mismatches, choose a consistent approach:
- Normalize through strings by applying
safe_cast(field AS STRING)
followed bysafe_cast(STRING AS expected_type)
to harmonize arrays, numerics, and booleans. - Keep values as
Object
, always surround inputs withsafe_cast(field AS expected_type)
so that failures yieldNULL
instead of exceptions. (Preferred)
3.Data loaded from OpenSearch
There are two loading strategies.
- Convert all values to strings when constructing rows. This simplifies downstream logic but can increase CPU cost for numeric workloads.
- Preserve native Java types in rows and cast on demand. (Current approach)
4.Output schema publication
When mappings are reliable, publish dataset-driven types. When mappings are unavailable or inconsistent, derive the output schema from operator requirements and the inserted casts. Fields affected by TRY semantics should be nullable, and the response should include drift and casting counters.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status