Skip to content

Conversation

ahkcs
Copy link
Contributor

@ahkcs ahkcs commented Oct 8, 2025

Description
Enhances the replace() function to support regular expressions, enabling advanced text manipulation in PPL queries.


Examples

Before (literal only):

source=logs | eval cleaned = replace(message, "ERROR", "WARN")

After (full regex):

# Remove digits
source=logs | eval cleaned = replace(message, '\d+', '')

# Swap date format (MM/DD/YYYY → DD/MM/YYYY)
source=logs | eval reformatted = replace(date, '^(\d{2})/(\d{2})/(\d{4})$', '\2/\1/\3')

# Extract domain from email
source=users | eval domain = replace(email, '.*@(.+)', '\1')

Copy link
Collaborator

@RyanL1997 RyanL1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ahkcs , thanks for the change. I just left a comment of re-using existing operator of regex replacement and you can take a look. If there is any questions, feel free to ask me.

Copy link
Collaborator

@RyanL1997 RyanL1997 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating the new UDF we should re-use the existing supported sql opperator. You can reference to my previous implementation at sed in rex that I was doing the same thing:

For more details you can also reference to: #4109 (comment)

REX_SED - Complete native Calcite optimization
REGEXP_REPLACE_PG_3 for basic substitution
REGEXP_REPLACE_PG_4 for flagged substitution (g, i)
REGEXP_REPLACE_5 for nth occurrence
TRANSLATE3 for transliteration (y/from/to/)

As you can see there are various of options that available for regex replace and I think you can choose the one you need.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, you may need to use this:

BuiltinFunctionName.INTERNAL_REGEXP_REPLACE_3,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RyanL1997 , thanks for the suggestion, I have updated the implementation to remove the custom UDF and leverage the existing sql operator

@RyanL1997 RyanL1997 added PPL Piped processing language calcite calcite migration releated labels Oct 8, 2025
@ahkcs ahkcs requested a review from RyanL1997 October 8, 2025 21:15
@RyanL1997
Copy link
Collaborator

RyanL1997 commented Oct 9, 2025

@ahkcs , also I was wondering for the usage of the regex pattern in this function, will it support captured name group? If that is the case, there are some known limitation of java regex library. see details at #4434 (comment), and lets see if we need to call this out in the related docs.

@ahkcs
Copy link
Contributor Author

ahkcs commented Oct 9, 2025

@ahkcs , also I was wondering for the usage of the regex pattern in this function, will it support captured name group? If that is the case, there are some known limitation of java regex library. see details at #4434 (comment), and lets say if we need to call this out in the related docs.

@RyanL1997 , thanks for calling it out, named groups like (?<domain_name>.+) are used by the rex command, which does have the Java regex limitation you mentioned (no underscores allowed in group names). The replace() function avoids this limitation entirely by using numbered references.

For example:

  • Rex fails: rex field=email ".+@(?<domain_name>.+)" (underscore not allowed)
  • Replace works: eval domain_name = replace(email, '.+@(.+)', '\1') (field name defined in eval, not in regex)

So the Java named group limitations in #4434 don't apply to replace() function. The field naming is controlled by the eval command, bypassing Java identifier restrictions.

@RyanL1997
Copy link
Collaborator

@ahkcs , also I was wondering for the usage of the regex pattern in this function, will it support captured name group? If that is the case, there are some known limitation of java regex library. see details at #4434 (comment), and lets say if we need to call this out in the related docs.

@RyanL1997 , thanks for calling it out, named groups like (?<domain_name>.+) are used by the rex command, which does have the Java regex limitation you mentioned (no underscores allowed in group names). The replace() function avoids this limitation entirely by using numbered references.

For example:

  • Rex fails: rex field=email ".+@(?<domain_name>.+)" (underscore not allowed)
  • Replace works: eval domain_name = replace(email, '.+@(.+)', '\1') (field name defined in eval, not in regex)

So the Java named group limitations in #4434 don't apply to replace() function. The field naming is controlled by the eval command, bypassing Java identifier restrictions.

Make sense, since in the case of this PR, the extraction should be done by eval so it won't triggered the java regex pattern limitation for named capture group.

Copy link
Collaborator

@RyanL1997 RyanL1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ahkcs , thanks for the update of the implementation and I just left some comments.

RexLiteral literal = (RexLiteral) replacementArg;
String replacement = literal.getValueAs(String.class);
if (replacement != null) {
// Convert sed backreferences (\1, \2) to Java style ($1, $2)
Copy link
Collaborator

@RyanL1997 RyanL1997 Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because the requirement requires java style? What is the expected behavior over here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we use java regex internally but we want to support PCRE style syntax, so we add this conversion logic to translate PCRE syntax to java regex

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: I remember I do have the same logic for sed, maybe we can extract them into a certain util.

@RyanL1997 RyanL1997 added the enhancement New feature or request label Oct 9, 2025
@ahkcs ahkcs requested a review from RyanL1997 October 9, 2025 16:53
RyanL1997
RyanL1997 previously approved these changes Oct 9, 2025
ahkcs added 6 commits October 10, 2025 14:18
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the feat/eval_replace branch from 14b1c1a to 7024259 Compare October 10, 2025 21:54
@ahkcs ahkcs requested a review from ykmr1224 October 10, 2025 22:19
ahkcs added 2 commits October 10, 2025 15:51
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
@RyanL1997 RyanL1997 merged commit a8f08ad into opensearch-project:main Oct 13, 2025
33 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4456-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 a8f08ad3e101e05aefecca4765d484d1992a12b3
# Push it to GitHub
git push --set-upstream origin backport/backport-4456-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-4456-to-2.19-dev.

ahkcs added a commit to ahkcs/sql that referenced this pull request Oct 16, 2025
ahkcs added a commit to ahkcs/sql that referenced this pull request Oct 16, 2025
(cherry picked from commit a8f08ad)
Signed-off-by: Kai Huang <ahkcs@amazon.com>
@LantaoJin LantaoJin added the backport-manually Filed a PR to backport manually. label Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev backport-failed backport-manually Filed a PR to backport manually. calcite calcite migration releated enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants