Skip to content

Conversation

@RyanL1997
Copy link
Collaborator

@RyanL1997 RyanL1997 commented Oct 3, 2025

Description

[Enhancement] Error handling for underscore usage in rex regex

  • Better handling for invalid named capture group
  • Updated the doc
  • Add ITs

Related Issue

Behavior after enhancement

# rex command
curl -X POST "localhost:9200/_plugins/_ppl" \
    -H "Content-Type: application/json" \
    -d '{
      "query": "source=accounts | rex field=email \".+@(?<domain_name>.+)\" | fields email, domain_name"
}'

{
  "error": {
    "reason": "Invalid Query",
    "details": "Invalid capture group name 'domain_name'. Java regex group names must start with a letter and contain only letters and digits.",
    "type": "IllegalArgumentException"
  },
  "status": 400
}


# parse command
curl -X POST "localhost:9200/_plugins/_ppl" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "source=accounts | parse email \".+@(?<domain-name>.+)\" | fields email, domain-name | head 5"
}'

{
  "error": {
    "reason": "Invalid Query",
    "details": "Invalid capture group name 'domain-name'. Java regex group names must start with a letter and contain only letters and digits.",
    "type": "IllegalArgumentException"
  },
  "status": 400
}% 

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…x pattern

Co-authored-by: Simeon Widdis <sawiddis@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997
Copy link
Collaborator Author

I think same as issue #3944
If yes, add a yamlRestTest to verify the issued fixed.

Hi @penghuo, good callout. I was actually taking a look for the - and ended up that is another limitation from java regex library. So rechecked and in summary:

Java Regex Named Group Character Rules

According to Java's Pattern class documentation, named capture groups follow these rules:

Valid Characters:

  • Must start with: Letter (a-z, A-Z)
  • Can contain: Letters (a-z, A-Z) and digits (0-9)

Invalid Characters (will cause PatternSyntaxException):

  • Underscore (_) - This is what we're handling with our validation
  • Hyphen/Dash (-) - This is what you tested with curl
  • Period/Dot (.)
  • Space ( )
  • Special characters (@, #, $, %, ^, &, *, etc.)
  • Unicode characters beyond ASCII
  • Cannot start with digit (0-9)

Java Documentation:

The official Java documentation for this is in:

  • Java Pattern class: java.util.regex.Pattern
  • Specific section: "Named-capturing group" in the Pattern syntax documentation
  • Rule: Named groups use the syntax (?X) where name must be a valid Java identifier but with more restrictions

The exact Java rule is:

  // From Java's Pattern class source code:
  // Named group name must match: [a-zA-Z][a-zA-Z0-9]*

Quick Test Examples:

# These will FAIL in Java regex:
(?<domain_name>...)  # underscore
(?<domain-name>...)  # hyphen  
(?<domain.name>...)  # dot
(?<domain name>...)  # space
(?<@domain>...)      # special char
(?<1domain>...)      # starts with digit

# These will WORK in Java regex:
(?<domain>...)       # letters only
(?<domainName>...)   # camelCase
(?<domain1>...)      # letters + numbers
(?<DOMAIN>...)       # uppercase

@RyanL1997 RyanL1997 changed the title [Enhancement] Error handling for underscore usage in rex regex [Enhancement] Error handling for illegal character usage in java regex named capture group Oct 9, 2025
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997
Copy link
Collaborator Author

RyanL1997 commented Oct 9, 2025

According to the above, since the limitation is coming from the upstream library we are using, I just unified the error msg to handle all the illegal character cases in extraction commands like rex and parse and for the above parse related issue I also left a comment of work around by using rename command: #3944 (comment).

cc @penghuo

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
penghuo
penghuo previously approved these changes Oct 9, 2025
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@ykmr1224
Copy link
Collaborator

Technically we should be able to support the invalid characters by rewriting regex and map extracted values back to original name like:

  1. original pattern: .+@(?<domain_name>.+)
  2. rewritten pattern: .+@(?<domainname>.+) (maintain mapping domainname -> domain_name internally)
  3. assign extracted value: domain_name = extracted['domainname']

This could cause name overwrap, but we can workaround by adding suffix, etc. to come up with unique name.

@RyanL1997
Copy link
Collaborator Author

RyanL1997 commented Oct 13, 2025

Hi @ykmr1224 , thanks for the input. And that was an interesting approach and I was actually did a PoC of your suggestion on my local. However, I do find some limitations of this approach:

  • If the renaming of the pattern is just removing the unsupported character, what if the renamed pattern after removal is duplicate with a existing field? (e.g. the index field contains a username, user would like to extract the alias of the email (xyz@email.com -> xyz) as a column called user_name and in the above logic it will be renamed as username which is conflict with the index field.
  • Stack traces and error messages refer to rewritten names (e.g. in explain) , which I think will cause the confusions for users.
  • I haven't try it on any of the benchmarking dataset yet for the re-writing approach, but from the time complexity wise
    • Current approach: O(1) per validation
    • Rewriting approach: O(k) where k=number of named groups (for mapping tables as we are re-writing the column names)

According to the above, I think we can keep as the current limitation handling approach for now, if there are some feature request coming from the users we can think about actually supporting these characters as a new feature requests. But if that day comes, I truly think the correct approach is to find a PCRE style library for regex patterns.

@ykmr1224
Copy link
Collaborator

@RyanL1997
Thanks for PoCing.
In case of name overwrap, we can add suffix like username1, and increment in case it still overwrap. (it will eventually find unique name)
i.e: (?<user_name>.+)(?<username>.+)(?<username1>.+) => (?<username2>.+)(?<username>(?<username1>.+), mapping = {username2 => user_name, username => username, username1 => username1}

Computation cost for regex rewrite should be ignorable since this happens only once during query analysis. And rename after extraction should be just an alias or projection (should not affect performance in my understanding).

It can be separate task from this PR, btw.
And I think we want to quickly check the possible usage of this use case to see the priority.

@RyanL1997
Copy link
Collaborator Author

It can be separate task from this PR, btw.
And I think we want to quickly check the possible usage of this use case to see the priority.

Transferring some internal communication over here - since the above issue is not blocking this change. This change is ready to be merged and I have created a issue #4549 to discover the feasibility of the above suggestion.

@RyanL1997 RyanL1997 merged commit 0b7e86c into opensearch-project:main Oct 14, 2025
34 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 14, 2025
…x named capture group (#4434)

Co-authored-by: Simeon Widdis <sawiddis@amazon.com>
(cherry picked from commit 0b7e86c)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
LantaoJin added a commit that referenced this pull request Oct 15, 2025
…x named capture group (#4434) (#4555)

(cherry picked from commit 0b7e86c)

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Simeon Widdis <sawiddis@amazon.com>
Co-authored-by: Lantao Jin <ltjin@amazon.com>
ykmr1224 added a commit to ykmr1224/sql that referenced this pull request Oct 15, 2025
commit cba8d02
Author: Tomoyuki MORITA <moritato@amazon.com>
Date:   Wed Oct 15 13:08:05 2025 -0700

    Add MAP_APPEND internal function to Calcite PPL (opensearch-project#4515)

    * Add MAP_APPEND internal function to Calcite PPL

    Signed-off-by: Tomoyuki Morita <moritato@amazon.com>

    * Minor fix

    Signed-off-by: Tomoyuki Morita <moritato@amazon.com>

    * Address comment

    Signed-off-by: Tomoyuki Morita <moritato@amazon.com>

    * Rebase and fix IT issue

    Signed-off-by: Tomoyuki Morita <moritato@amazon.com>

    ---------

    Signed-off-by: Tomoyuki Morita <moritato@amazon.com>

commit 3388dc7
Author: Lantao Jin <ltjin@amazon.com>
Date:   Thu Oct 16 01:45:29 2025 +0800

    Use `_doc` + `_shard_doc` as sort tiebreaker to get better performance (opensearch-project#4569)

    * Use _shard_doc as sort tiebreaker

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * _doc as a part of tie-breaker have better performance

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    ---------

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

commit 5630119
Author: qianheng <qianheng@amazon.com>
Date:   Wed Oct 15 16:40:41 2025 +0800

    Fix sort push down into agg after project already pushed (opensearch-project#4546)

    * Fix sort push down into agg

    Signed-off-by: Heng Qian <qianheng@amazon.com>

    * Change some json files to yaml format

    Signed-off-by: Heng Qian <qianheng@amazon.com>

    ---------

    Signed-off-by: Heng Qian <qianheng@amazon.com>

commit 1e62fba
Author: Tomoyuki MORITA <moritato@amazon.com>
Date:   Tue Oct 14 17:20:38 2025 -0700

    Fix JsonExtractAllFunctionIT failure (opensearch-project#4556)

    Signed-off-by: Tomoyuki Morita <moritato@amazon.com>

commit 02ee33e
Author: Kai Huang <105710027+ahkcs@users.noreply.github.com>
Date:   Tue Oct 14 14:28:53 2025 -0700

    Add more examples to the `where` command doc (opensearch-project#4457)

    Co-authored-by: Manasvini B S <manasvis@amazon.com>

commit 0b7e86c
Author: Jialiang Liang <jiallian@amazon.com>
Date:   Tue Oct 14 10:46:01 2025 -0700

    [Enhancement] Error handling for illegal character usage in java regex named capture group (opensearch-project#4434)

    Co-authored-by: Simeon Widdis <sawiddis@amazon.com>

commit 9c97cfb
Author: Tomoyuki MORITA <moritato@amazon.com>
Date:   Tue Oct 14 08:36:43 2025 -0700

    Add JSON_EXTRACT_ALL internal function for Calcite PPL (opensearch-project#4489)

    * Add JSON_EXTRACT_ALL internal function for Calcite PPL

    Signed-off-by: Tomoyuki Morita <moritato@amazon.com>

    * Address comments

    Signed-off-by: Tomoyuki Morita <moritato@amazon.com>

    * Minor fix

    Signed-off-by: Tomoyuki Morita <moritato@amazon.com>

    ---------

    Signed-off-by: Tomoyuki Morita <moritato@amazon.com>

commit 89dbc31
Author: Lantao Jin <ltjin@amazon.com>
Date:   Tue Oct 14 18:24:52 2025 +0800

    Check server status before starting Prometheus (opensearch-project#4537)

    * Check server status before starting Prometheus

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * Change to func call

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * Fix doc

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    ---------

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

commit fe62472
Author: Lantao Jin <ltjin@amazon.com>
Date:   Tue Oct 14 18:10:27 2025 +0800

    Update request builder after pushdown sort into agg buckets (opensearch-project#4541)

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

commit 42a415f
Author: qianheng <qianheng@amazon.com>
Date:   Tue Oct 14 17:42:45 2025 +0800

    Including metadata fields type when doing agg/filter script push down (opensearch-project#4522)

    * Including metadata fields type when doing agg/filter script push down

    Signed-off-by: Heng Qian <qianheng@amazon.com>

    * Fix IT

    Signed-off-by: Heng Qian <qianheng@amazon.com>

    ---------

    Signed-off-by: Heng Qian <qianheng@amazon.com>

commit 8de0386
Author: Xinyuan Lu <xinyual@amazon.com>
Date:   Tue Oct 14 16:41:08 2025 +0800

    Fix percentile bug (opensearch-project#4539)

    * fix percentile bug

    Signed-off-by: xinyual <xinyual@amazon.com>

    * add IT

    Signed-off-by: xinyual <xinyual@amazon.com>

    * optimize it

    Signed-off-by: xinyual <xinyual@amazon.com>

    ---------

    Signed-off-by: xinyual <xinyual@amazon.com>

commit de2fdc8
Author: Lantao Jin <ltjin@amazon.com>
Date:   Tue Oct 14 12:29:03 2025 +0800

    [FollowUp] Set 0 and negative value of subsearch.maxout as unlimited (opensearch-project#4534)

    * [FollowUp] Set 0 and negative value of subsearch.maxout as unlimited

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * fix doctest

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * Fix conflicts

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    ---------

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

commit 977b7ab
Author: Simeon Widdis <sawiddis@gmail.com>
Date:   Mon Oct 13 20:23:10 2025 -0700

    Update stalled action (opensearch-project#4485)

commit fddbb70
Author: Lantao Jin <ltjin@amazon.com>
Date:   Tue Oct 14 10:23:12 2025 +0800

    Add configurable sytem limitations for `subsearch` and `join` command (opensearch-project#4501)

    * Add configurable sytem limitations for subsearch and join command

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * Fix IT

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * typo

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * fix IT

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * remove rollback in doc

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * address comments

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * fix typo

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    * Fix IT

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

    ---------

    Signed-off-by: Lantao Jin <ltjin@amazon.com>

Signed-off-by: Tomoyuki Morita <moritato@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev calcite calcite migration releated enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Error handling for unsupported characters in java regex library [FEATURE] Support _/- as parsed field name

5 participants