Skip to content

Integration tests for LOOKUP JOIN over wider range of data types #126150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 7, 2025

Conversation

craigtaverner
Copy link
Contributor

@craigtaverner craigtaverner commented Apr 2, 2025

This test suite tests the lookup join functionality in ESQL with various data types.

For each pair of types being tested, it builds a main index called "index" containing a single document with as many fields as types being tested on the left of the pair, and then creates that many other lookup indexes, each with a single document containing exactly two fields: the field to join on, and a field to return.

The assertion is that for valid combinations, the return result should exist, and for invalid combinations an exception should be thrown. If no exception is thrown, and no result is returned, our validation rules are not aligned with the internal behaviour (ie. a bug).

Since the LOOKUP JOIN command requires the match field name to be the same between the main index and the lookup index, we will have field names that correctly represent the type of the field in the main index, but not the type of the field in the lookup index. This can be confusing, but it is important to remember that the field names are not the same as the types.

For example, if we are testing the pairs (double, double), (double, float), (float, double) and (float, float), we will create the following indexes:

index_double_double
Index containing a single document with a field of type 'double' like:
        {
            "field_double": 1.0,  // this is mapped as type 'double'
            "other": "value"
        }
    
index_double_float
Index containing a single document with a field of type 'float' like:
        {
            "field_double": 1.0,  // this is mapped as type 'float' (a float with the name of the main index field) 
            "other": "value"
        }
    
index_float_double
Index containing a single document with a field of type 'double' like:
        {
            "field_float": 1.0,  // this is mapped as type 'double' (a double with the name of the main index field)
            "other": "value"
        }
    
index_float_float
Index containing single document with a field of type 'float' like:
        {
            "field_float": 1.0,  // this is mapped as type 'float'
            "other": "value"
        }
    
index
Index containing document like:
        {
            "field_double": 1.0,  // this is mapped as type 'double'
            "field_float": 1.0    // this is mapped as type 'float'
        }
    

Note that the lookup indexes have fields with a name that matches the type in the main index, and not the type actually used in the lookup index. Instead, the mapped type should be the type of the right-hand side of the pair being tested.
Then we can run queries like:

    FROM index | LOOKUP JOIN index_double_float ON field_double | KEEP other

And assert that the result exists and is equal to "value".

Checklist:

  • Replicate existing KEYWORD/TEXT tests
  • Test Float/Double combinations
  • Test all integer type combinations
  • Test mixed numerical types (integer vs float)
  • Test for all types where left-right are the same type (all is a fixed list of testable types)
  • Test for all type combinations where left-right are different types
  • Expand the all list by depending on more modules (eg. unsigned_long and spatial types)
    • unsigned_long (from x-pack:plugin:mapper-unsigned-long)
    • version (found in module VersionFieldPlugin)
    • spatial types (requires local type mapping to remove cartesian_ prefix and module access for shapes
    • datetime (added tests)
    • date_nanos(no results are found even with identical fields, so we added this to the unsupported list and throw a validation error) - Created an issue for this at LOOKUP JOIN on date_nanos fields finds zero matches #127249
  • Explicitly test all unsupported types (should throw errors)
    • Define list of unsupported types
    • Assert they throw errors, and update Join validation if they do not
    • Assert that the set of supported and unsupported types covers all types (minus some non-types, like _doc, etc.)

One thing to consider changing here is allowing float and integer types to be used together. Right now the only thing blocking this is the validation code. The join actually succeeds if we remove the validation.

@craigtaverner craigtaverner added >test Issues or PRs that are addressing/adding tests Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Apr 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is already looking great!

Before we merge, I wonder if we could simplify the test setup and make the test class a tad easier to follow. Although this is already a great value-add and is safe to merge.

I agree that as a next step, we should expand the number of different types covered.

* <dt>index_double_float</dt>
* <dd>Index containing a single document with a field of type 'float' like: <pre>
* {
* "field_double": 1.0, // this is mapped as type 'float' (a float with the name of the main index field)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may cut down on the number of created indices (maybe speeding up the test a little) and make this simpler to debug if we always name the fields by the type. We can just perform a RENAME in the query. Prefixing the field names with main_field or lookup_field would also guarantee that we don't accidentally shadow while renaming, and it will make the javadoc more self explanatory.

In fact, the test index creation could be simplified to 2 indices: one main index with every type we currently consider, and one lookup index with every type we currently consider. This means index creation can be done entirely during the setup of the test suite, rather than in each test.

This concern may be more important later, though, when we generalize the join condition so that multiple fields (a composite key) can be used. Then we probably also want to expand this test to use different combinations of types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea, but think we can re-visit this later.

* For example, if we are testing the pairs (double, double), (double, float), (float, double) and (float, float),
* we will create the following indexes:
* <dl>
* <dt>index_double_double</dt>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's unclear whether this is the main or the lookup index when reading this javadoc.

{
TestConfigs configs = testConfigurations.computeIfAbsent("strings", TestConfigs::new);
configs.addPasses(KEYWORD, KEYWORD);
configs.addPasses(TEXT, KEYWORD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a TEXT field without a .keyword subfield? I think we fail in this situation. Could you maybe double check and add a test if that's possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always pass with TEXT on the left because we will read from source for the main index if no KEYWORD subfield exists. Likewise we always fail with TEXT on the right, even if a KEYWORD subfield exists because we have not coded a special case for that (yet).

Comment on lines 345 to 347
private void addEmptyResult(DataType mainType, DataType lookupType) {
add(new TestConfigPasses(mainType, lookupType, false));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears unused?

Similarly, the boolean arg in the TestConfigPasses could always be true, then, no?

Copy link
Contributor Author

@craigtaverner craigtaverner May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was using this to assert on known failing cases (lookup does not match, but no errors are thrown), but I have since taken a different direction, adding these cases to the type validation so we get an error instead of an empty result. I'll delete this method.

Comment on lines 382 to 384
private <E extends Exception> void addFails(DataType mainType, DataType lookupType, Class<E> exception, Consumer<E> assertion) {
add(new TestConfigFails<>(mainType, lookupType, exception, assertion));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was meant to be used in the other addFails methods above? Currently, I think it is unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this method adds very close to zero value, so instead of using it, I will delete it.

TestConfigs configs = testConfigurations.computeIfAbsent("mixed-numerical", TestConfigs::new);
for (DataType mainType : integerTypes) {
for (DataType lookupType : floatTypes) {
// TODO: We should probably allow this, but we need to change the validation code in Join.java
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, the behavior should be exactly as if we evaluated an ==, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a followup issue to investigate this and try to support it: #127806

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
The setup is thorough, but I'd maybe echo Alex's note on complexity: setting up indices does take time and while this test itself takes under 10s (w/o gradle setup), it could probably be sped-up with just two indices and renames (which would be skipped on same type in both indices, so the RENAME itself should not introduce a blind corner).
This complexity surfaces a bit also in the need to test the test (validateIndex()).

Collection<TestConfigs> existing = testConfigurations.values();
TestConfigs configs = testConfigurations.computeIfAbsent("same", TestConfigs::new);
for (DataType type : all) {
if (existingIndex(existing, type, type)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supernit: test against false and skip the continue for slightly better legibility.
Here and below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@costin
Copy link
Member

costin commented Apr 23, 2025

Let's get this PR merged in since it's been sitting here for a while - and move the nanos and potentially speed-up (if it can't be addressed shortly) in a follow-up.

@craigtaverner craigtaverner added the auto-backport Automatically create backport pull requests when merged label May 7, 2025
@craigtaverner
Copy link
Contributor Author

Since this PR adds additional validation errors for unsupported types, instead of silently failing (getting no matches in the join), I've viewing that as a kind-of bug-fix worth backporting to 9.0.

@craigtaverner craigtaverner merged commit afc53a3 into elastic:main May 7, 2025
17 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
9.0

craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request May 7, 2025
…stic#126150)

This test suite tests the lookup join functionality in ESQL with various data types.

For each pair of types being tested, it builds a main index called "index" containing a single document with as many fields as types being tested on the left of the pair, and then creates that many other lookup indexes, each with a single document containing exactly two fields: the field to join on, and a field to return.

The assertion is that for valid combinations, the return result should exist, and for invalid combinations an exception should be thrown. If no exception is thrown, and no result is returned, our validation rules are not aligned with the internal behaviour (ie. a bug).

Since the `LOOKUP JOIN` command requires the match field name to be the same between the main index and the lookup index, we will have field names that correctly represent the type of the field in the main index, but not the type of the field in the lookup index. This can be confusing, but it is important to remember that the field names are not the same as the types.
@craigtaverner
Copy link
Contributor Author

@bpintea and @alex-spies, you both suggested refining the index setup, and so I have created an issue recommending just that, at #127819

elasticsearchmachine pushed a commit that referenced this pull request May 7, 2025
#126150) (#127818)

* Integration tests for LOOKUP JOIN over wider range of data types (#126150)

This test suite tests the lookup join functionality in ESQL with various data types.

For each pair of types being tested, it builds a main index called "index" containing a single document with as many fields as types being tested on the left of the pair, and then creates that many other lookup indexes, each with a single document containing exactly two fields: the field to join on, and a field to return.

The assertion is that for valid combinations, the return result should exist, and for invalid combinations an exception should be thrown. If no exception is thrown, and no result is returned, our validation rules are not aligned with the internal behaviour (ie. a bug).

Since the `LOOKUP JOIN` command requires the match field name to be the same between the main index and the lookup index, we will have field names that correctly represent the type of the field in the main index, but not the type of the field in the lookup index. This can be confusing, but it is important to remember that the field names are not the same as the types.

* SEMANTIC_TEXT Still exists on 9.0 as a zombie type
ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request May 9, 2025
…stic#126150)

This test suite tests the lookup join functionality in ESQL with various data types.

For each pair of types being tested, it builds a main index called "index" containing a single document with as many fields as types being tested on the left of the pair, and then creates that many other lookup indexes, each with a single document containing exactly two fields: the field to join on, and a field to return.

The assertion is that for valid combinations, the return result should exist, and for invalid combinations an exception should be thrown. If no exception is thrown, and no result is returned, our validation rules are not aligned with the internal behaviour (ie. a bug).

Since the `LOOKUP JOIN` command requires the match field name to be the same between the main index and the lookup index, we will have field names that correctly represent the type of the field in the main index, but not the type of the field in the lookup index. This can be confusing, but it is important to remember that the field names are not the same as the types.
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request May 9, 2025
…stic#126150)

This test suite tests the lookup join functionality in ESQL with various data types.

For each pair of types being tested, it builds a main index called "index" containing a single document with as many fields as types being tested on the left of the pair, and then creates that many other lookup indexes, each with a single document containing exactly two fields: the field to join on, and a field to return.

The assertion is that for valid combinations, the return result should exist, and for invalid combinations an exception should be thrown. If no exception is thrown, and no result is returned, our validation rules are not aligned with the internal behaviour (ie. a bug).

Since the `LOOKUP JOIN` command requires the match field name to be the same between the main index and the lookup index, we will have field names that correctly represent the type of the field in the main index, but not the type of the field in the lookup index. This can be confusing, but it is important to remember that the field names are not the same as the types.
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request May 12, 2025
…stic#126150)

This test suite tests the lookup join functionality in ESQL with various data types.

For each pair of types being tested, it builds a main index called "index" containing a single document with as many fields as types being tested on the left of the pair, and then creates that many other lookup indexes, each with a single document containing exactly two fields: the field to join on, and a field to return.

The assertion is that for valid combinations, the return result should exist, and for invalid combinations an exception should be thrown. If no exception is thrown, and no result is returned, our validation rules are not aligned with the internal behaviour (ie. a bug).

Since the `LOOKUP JOIN` command requires the match field name to be the same between the main index and the lookup index, we will have field names that correctly represent the type of the field in the main index, but not the type of the field in the lookup index. This can be confusing, but it is important to remember that the field names are not the same as the types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test Issues or PRs that are addressing/adding tests v9.0.2 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants