Skip to content

Conversation

@yuancu
Copy link
Collaborator

@yuancu yuancu commented Oct 21, 2025

Description

Previously, externally registered OpenSearch UDFs can not be serialized as they are not registered in RelJsonSerializer. This PR collects these UDFs in a SqlOperatorTable and registers it in RelJsonSerializer.

Blocker

Blocked by #2813 (or potentially other issues that restrict groupping by struct fields)

UDF is serialized, but grouping by a generated struct seems to be problematic after pushdown.

  • source=weblogs | where host='1.2.3.4' | eval info = geoip('my-datasource', host) | stats count() by info:

    {
      "error": {
        "reason": "There was internal problem at backend",
        "details": "java.sql.SQLException: exception while executing query: class java.lang.String cannot be cast to class java.util.Map (java.lang.String and java.util.Map are in module java.base of loader 'bootstrap')",
        "type": "RuntimeException"
      },
      "status": 500
    }

    In this case, the result map is converted to a string when used as a group key.

    Directly running the DSL with the script gives the bucket key as string

    Query:

    {"from":0,"size":0,"timeout":"1m","query":{"term":{"host":{"value":"1.2.3.4","boost":1.0}}},"_source":{"includes":["host"],"excludes":[]},"aggregations":{"composite_buckets":{"composite":{"size":10000,"sources":[{"info":{"terms":{"script":{"source":"{\"langType\":\"calcite\",\"script\":\"rO0ABXNyABFqYXZhLnV0aWwuQ29sbFNlcleOq7Y6G6gRAwABSQADdGFneHAAAAADdwQAAAAGdAAHcm93VHlwZXQAknsKICAiZmllbGRzIjogWwogICAgewogICAgICAidWR0IjogIkVYUFJfSVAiLAogICAgICAidHlwZSI6ICJPVEhFUiIsCiAgICAgICJudWxsYWJsZSI6IHRydWUsCiAgICAgICJuYW1lIjogImhvc3QiCiAgICB9CiAgXSwKICAibnVsbGFibGUiOiBmYWxzZQp9dAAEZXhwcnQCw3sKICAib3AiOiB7CiAgICAibmFtZSI6ICJHRU9JUCIsCiAgICAia2luZCI6ICJPVEhFUl9GVU5DVElPTiIsCiAgICAic3ludGF4IjogIkZVTkNUSU9OIgogIH0sCiAgIm9wZXJhbmRzIjogWwogICAgewogICAgICAibGl0ZXJhbCI6ICJteS1kYXRhc291cmNlIiwKICAgICAgInR5cGUiOiB7CiAgICAgICAgInR5cGUiOiAiVkFSQ0hBUiIsCiAgICAgICAgIm51bGxhYmxlIjogZmFsc2UsCiAgICAgICAgInByZWNpc2lvbiI6IC0xCiAgICAgIH0KICAgIH0sCiAgICB7CiAgICAgICJpbnB1dCI6IDAsCiAgICAgICJuYW1lIjogIiQwIgogICAgfQogIF0sCiAgImNsYXNzIjogIm9yZy5vcGVuc2VhcmNoLnNxbC5leHByZXNzaW9uLmZ1bmN0aW9uLlVzZXJEZWZpbmVkRnVuY3Rpb25CdWlsZGVyJDEiLAogICJ0eXBlIjogewogICAgInR5cGUiOiAiTUFQIiwKICAgICJudWxsYWJsZSI6IGZhbHNlLAogICAgImtleSI6IHsKICAgICAgInR5cGUiOiAiVkFSQ0hBUiIsCiAgICAgICJudWxsYWJsZSI6IGZhbHNlLAogICAgICAicHJlY2lzaW9uIjogLTEKICAgIH0sCiAgICAidmFsdWUiOiB7CiAgICAgICJ0eXBlIjogIkFOWSIsCiAgICAgICJudWxsYWJsZSI6IGZhbHNlLAogICAgICAicHJlY2lzaW9uIjogLTEsCiAgICAgICJzY2FsZSI6IC0yMTQ3NDgzNjQ4CiAgICB9CiAgfSwKICAiZGV0ZXJtaW5pc3RpYyI6IHRydWUsCiAgImR5bmFtaWMiOiBmYWxzZQp9dAAKZmllbGRUeXBlc3NyABFqYXZhLnV0aWwuSGFzaE1hcAUH2sHDFmDRAwACRgAKbG9hZEZhY3RvckkACXRocmVzaG9sZHhwP0AAAAAAAAx3CAAAABAAAAABdAAEaG9zdH5yAClvcmcub3BlbnNlYXJjaC5zcWwuZGF0YS50eXBlLkV4cHJDb3JlVHlwZQAAAAAAAAAAEgAAeHIADmphdmEubGFuZy5FbnVtAAAAAAAAAAASAAB4cHQAAklQeHg=\"}","lang":"opensearch_compounded_script","params":{"utcTimestamp":1761646601217183000}},"missing_bucket":true,"missing_order":"first","order":"asc"}}}]}}}}

    Result:

    {
      "took": 137,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 1,
          "relation": "eq"
        },
        "max_score": null,
        "hits": []
      },
      "aggregations": {
        "composite_buckets": {
          "after_key": {
            "info": "{continent_name=Oceania, country_iso_code=AU, country_name=Australia, location=-33.4940,143.2104, time_zone=Australia/Sydney}"
          },
          "buckets": [
            {
              "key": {
                "info": "{continent_name=Oceania, country_iso_code=AU, country_name=Australia, location=-33.4940,143.2104, time_zone=Australia/Sydney}"
              },
              "doc_count": 1
            }
          ]
        }
      }
    }
For reference, the result of `source=weblogs | where host='1.2.3.4' | eval info = geoip('my-datasource', host)` is:
{
  "schema": [
    {
      "name": "host",
      "type": "ip"
    },
    {
      "name": "method",
      "type": "string"
    },
    {
      "name": "bytes",
      "type": "string"
    },
    {
      "name": "response",
      "type": "string"
    },
    {
      "name": "url",
      "type": "string"
    },
    {
      "name": "info",
      "type": "struct"
    }
  ],
  "datarows": [
    [
      "1.2.3.4",
      "GET",
      "1234",
      "200",
      "/history/voyager1/",
      {
        "continent_name": "Oceania",
        "country_iso_code": "AU",
        "country_name": "Australia",
        "location": "-33.4940,143.2104",
        "time_zone": "Australia/Sydney"
      }
    ]
  ],
  "total": 1,
  "size": 1
}

Workaround

Instead of directly groupping by the struct generated by geoip, I created tests that group results by sub-fields of result struct of geoip function. E.g. if the result object info {"str": "a string", "num": 1}, I group results by info.str.

Related Issues

Resolves #4478

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@yuancu yuancu added the bug Something isn't working label Oct 22, 2025
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
…r types are UNDEFINED

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
@yuancu yuancu changed the title WIP: Support serializing external OpenSearch UDFs at pushdown time Support serializing external OpenSearch UDFs at pushdown time Oct 28, 2025
@yuancu yuancu marked this pull request as ready for review October 28, 2025 10:06
@LantaoJin LantaoJin merged commit a003e8c into opensearch-project:main Nov 3, 2025
64 of 65 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4618-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 a003e8c7399b62ebc3a57c07bb98f11d843ceb82
# Push it to GitHub
git push --set-upstream origin backport/backport-4618-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-4618-to-2.19-dev.

@yuancu yuancu deleted the issues/4478 branch November 4, 2025 02:05
yuancu added a commit to yuancu/sql-plugin that referenced this pull request Nov 4, 2025
…arch-project#4618)

* Supports serilizing external OpenSearch UDFs

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Correct subfield access logical when calling ITEM

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Resolve types of generated structs based on their values because their types are UNDEFINED

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Add explain and integration tests for geoip

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

---------

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
(cherry picked from commit a003e8c)
yuancu added a commit that referenced this pull request Nov 4, 2025
…ushdown time (#4618) (#4725)

* Support serializing external OpenSearch UDFs at pushdown time (#4618)

* Supports serilizing external OpenSearch UDFs

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Correct subfield access logical when calling ITEM

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Resolve types of generated structs based on their values because their types are UNDEFINED

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Add explain and integration tests for geoip

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

---------

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
(cherry picked from commit a003e8c)

* Update plan for testGeoIpPushedInAgg

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Mask position of fields to fix an undeterministic position of a field in the result plan for java21

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

---------

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
@yuancu yuancu added the backport-manually Filed a PR to backport manually. label Nov 4, 2025
expani pushed a commit to vinaykpud/sql that referenced this pull request Nov 4, 2025
…arch-project#4618)

* Supports serilizing external OpenSearch UDFs

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Correct subfield access logical when calling ITEM

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Resolve types of generated structs based on their values because their types are UNDEFINED

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Add explain and integration tests for geoip

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

---------

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev backport-failed backport-manually Filed a PR to backport manually. bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] geoip() function fails with deserialization error when used in aggregation grouping context

3 participants