Skip to content

Support nested aggregation when calcite enabled #3696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented May 29, 2025

Description

Support nested aggregation in PPL only when calcite enabled.

With this PR, follow PPL query is able to execute a nested aggregation query.

source=logs | where response = 200 | stats min(pages.load_time)

And it equals the DSL

GET logs/_search
{
  "query": {
    "match": { "response": "200" }
  },
  "aggs": {
    "pages": {
      "nested": {
        "path": "pages"
      },
      "aggs": {
        "min_load_time": { "min": { "field": "pages.load_time" } }
      }
    }
  }
}

Deprecated implementation for SQL: #2814

Limitation:

  • PPL only
  • Calcite should be enabled
  • Aggregate pushdown happens

Related Issues

PR partially resolves #2813 and #2529, and #2739

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin
Copy link
Member Author

CI failure is not related to this PR. It caused by opensearch-project/OpenSearch#18124

Copy link
Collaborator

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: I recall we didn't support nested field type in V2 engine, right? Did we add it to Calcite or this is only for pushdown? Ref: #52

@LantaoJin
Copy link
Member Author

LantaoJin commented May 30, 2025

QQ: I recall we didn't support nested field type in V2 engine, right? Did we add it to Calcite or this is only for pushdown? Ref: #52

Nested field maps to ARRAY type in OpenSearch Core (you can get it by creating an index with nested field and checking its mapping via DSL), so IMO v2 supports nested field with the ExprCoreType.ARRAY type.

I believe we also can support nested aggregation in v2 but the fixing will be complex by my trying. Now I have no idea how to pass the fieldTypes map to AggregatorBuilder logic. We need it to identify whether or not the root in aggregation NamedExpression root.subfield is a ARRAY type.

For pushdown disabled, I think we need additional effort to correct the aggregation behavior. So I add a TODO in code. Or find a way to throw an exception.

@qianheng-aws
Copy link
Collaborator

The value of nested field in PPL is incorrect.

Seems we will only show the first element of that nested fields while it's actually an inner field of an array of object. That's why it will calculate wrong results if no push down.

// For array types only first index currently supported.

DSL:

POST /test1/_search
{
  "_source":{"includes":["address.area"]}
}

{
  "took": 11,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "test1",
        "_id": "u6OQIJcBpwe8-qLMf_ZG",
        "_score": 1.0,
        "_source": {
          "address": [
            {
              "area": 300.13
            },
            {
              "area": 400.99
            },
            {
              "area": 127.4
            },
            {
              "area": 10.24
            }
          ]
        }
      }
    ]
  }
}

PPL:

POST /_plugins/_ppl
source = test1 | fields address.area

{
  "schema": [
    {
      "name": "address.area",
      "type": "double"
    }
  ],
  "datarows": [
    [
      300.13
    ]
  ],
  "total": 1,
  "size": 1
}

@LantaoJin
Copy link
Member Author

LantaoJin commented Jun 4, 2025

Seems we will only show the first element of that nested fields while it's actually an inner field of an array of object. That's why it will calculate wrong results if no push down.

// For array types only first index currently supported.

Thanks, please let me know why we just return first element? Let me try to change this logic first.

@LantaoJin
Copy link
Member Author

LantaoJin commented Jun 4, 2025

Seems we will only show the first element of that nested fields while it's actually an inner field of an array of object. That's why it will calculate wrong results if no push down.

// For array types only first index currently supported.

Thanks, please let me know why we just return first element? Let me try to change this logic first.

Perhaps I know the reason now. I update above code to

wholePathValue = new ExprCollectionValue(value.collectionValue().stream().map(e -> e.keyValue(paths.getFirst())).toList());

but it doesn't work since we flatten the schema (mapping):

var nextPrefix =
prefix.isEmpty() ? entryKey : String.format("%s.%s", prefix, entryKey);
result.put(nextPrefix, entry.getValue().cloneEmpty());

For example address.area is mapping to ExprCoreType.Double and we don't have an Array[Double] type in ExprCoreType. To support reading a sub-field, we need support Array type of ExprCoreType.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support nested aggregation
3 participants