[RFC] Add `rex` Command to PPL (Calcite Engine)

## Problem Statement

In data analysis and log processing workflows, users frequently need to extract structured information from unstructured text fields and perform text transformations, for example:

- Extracting email usernames and domains from user registration data
- Parsing log entries to extract timestamp, severity level, and error codes
- Transforming sensitive data by masking or anonymizing field values
- Extracting multiple values from comma-separated or space-delimited fields
- Converting field formats for standardization (e.g., phone numbers, addresses)

Currently, PPL lacks comprehensive text processing capabilities, forcing users to rely on basic string functions or external processing tools. This limits the ability to perform complex field extraction and transformation tasks within PPL pipelines.

## Current State

- PPL provides basic string functions like `substr`, `concat`, and `trim`
- Limited pattern-based field extraction capabilities
- No native support for named capture groups or complex regex operations
- Users must resort to multiple commands or external tools for text processing workflows
- No standardized approach for field masking or data anonymization

## Long-Term Goals

- Provide comprehensive text processing capabilities within PPL pipelines
- Enable efficient server-side pattern matching and field extraction
- Support both field extraction and in-place text transformation operations
- Establish consistent patterns for text processing commands in PPL
- Enable complex data parsing workflows for log analysis and data preparation

## Proposal

### High-Level Functionality

The `rex` command provides two primary modes for text processing:

1. **Field Extraction Mode (Default)**: Extract structured data from text using named capture groups
2. **Text Transformation Mode**: Modify field values using find-and-replace operations

### Syntax

#### Field Extraction Mode
```bash
source | rex [field=<field>] "<pattern>" [max_match=<int>] [offset_field=<string>]
```

#### Text Transformation Mode
```bash
source | rex [field=<field>] mode=sed "<sed-expression>"
```

### Parameters

- **field**: Source field to process (defaults to _raw field handling)
- **pattern**: Regular expression with named capture groups for extraction
- **max_match**: Controls number of matches extracted (1=single value, >1=array, 0=unlimited)
- **offset_field**: Creates field tracking character positions of matches
- **mode=sed**: Enables text transformation mode using sed-style expressions

### Field Extraction Mode

#### Basic Named Group Extraction
```bash
# Extract email components
source=accounts | rex field=email "(?<username>[^@]+)@(?<domain>[^.]+)\.(?<tld>.+)" | fields username, domain, tld

# Parse log entries  
source=logs | rex field=message "(?<timestamp>\\d{4}-\\d{2}-\\d{2}) (?<level>\\w+) (?<component>\\w+): (?<msg>.*)" | fields timestamp, level, component, msg

# Extract multiple phone number parts
source=contacts | rex field=phone "(?<country>\\+\\d{1,3})-(?<area>\\d{3})-(?<number>\\d{3}-\\d{4})" | fields country, area, number
```

#### Multi-Value Extraction with max_match
```bash
# Extract all words from a text field (unlimited)
source=documents | rex field=content "(?<words>\\w+)" max_match=0 | fields content, words

# Extract first 3 numbers from address
source=addresses | rex field=address "(?<numbers>\\d+)" max_match=3 | fields address, numbers

# Extract hashtags from social media posts  
source=posts | rex field=text "(?<hashtags>#\\w+)" max_match=5 | fields text, hashtags
```

#### Position Tracking with offset_field
```bash
# Track where email username appears in text
source=emails | rex field=message "(?<email>[^@]+@[^\\s]+)" offset_field=email_positions | fields message, email, email_positions

# Find positions of all phone numbers in document
source=documents | rex field=text "(?<phone>\\d{3}-\\d{3}-\\d{4})" max_match=0 offset_field=phone_locations | fields text, phone, phone_locations
```

### Text Transformation Mode

#### String Substitution
```bash  
# Mask email domains for privacy
source=users | rex field=email mode=sed "s/@.*/@company.com/" | fields email

# Replace all digits with X for anonymization
source=logs | rex field=client_ip mode=sed "s/\\d/X/g" | fields client_ip

# Standardize phone number format
source=contacts | rex field=phone mode=sed "s/\$(\\d{3})\$\\s*(\\d{3})-(\\d{4})/\\1-\\2-\\3/" | fields phone
```

#### Character Transliteration
```bash
# Convert special characters for URL safety
source=articles | rex field=title mode=sed "y/ /_/" | fields title

# Remove vowels from sensitive field
source=data | rex field=sensitive mode=sed "y/aeiouAEIOU/_________/" | fields sensitive

# Convert case manually (demonstration)
source=names | rex field=first_name mode=sed "y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/" | fields first_name  
```

### Advanced Use Cases

#### Combined Extraction and Transformation
```bash
# Extract original data, then create masked version
source=customers | rex field=email "(?<username>[^@]+)" | rex field=email mode=sed "s/^(.{2}).*@/\\1***@/" | fields email, username

# Parse structured data from multiple formats
source=logs | rex field=raw_log "(?<timestamp>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2})" | rex field=raw_log mode=sed "s/T/ /" | fields raw_log, timestamp
```

#### Multi-Field Processing
```bash
# Extract from one field, transform another
source=users | rex field=full_name "(?<first>\\w+) (?<last>\\w+)" | rex field=email mode=sed "s/@.*/@anonymous.com/" | fields first, last, email

# Process multiple fields with different patterns
source=products | rex field=sku "(?<category>[A-Z]+)-(?<id>\\d+)" | rex field=description mode=sed "s/\\$[0-9.]+/[PRICE_HIDDEN]/g" | fields category, id, description
```

### Complex Pattern Examples

#### Log Analysis
```bash
# Apache access log parsing
source=apache_logs | rex field=log "(?<ip>\\d+\\.\\d+\\.\\d+\\.\\d+) - - \\[(?<timestamp>[^\\]]+)\\] \"(?<method>\\w+) (?<url>[^\\s]+) (?<protocol>[^\"]+)\" (?<status>\\d+) (?<size>\\d+)" | fields ip, timestamp, method, url, status, size

# JSON log extraction  
source=app_logs | rex field=message "\"level\":\"(?<log_level>\\w+)\",\"msg\":\"(?<log_message>[^\"]+)\",\"time\":\"(?<log_time>[^\"]+)\"" | fields log_level, log_message, log_time

# Error pattern extraction with multiple matches
source=error_logs | rex field=stacktrace "(?<errors>\\w+Exception)" max_match=0 | rex field=stacktrace "(?<methods>\\w+\\.\\w+\\()" max_match=5 | fields errors, methods
```

#### Data Standardization
```bash
# Address parsing and standardization
source=addresses | rex field=address "(?<number>\\d+) (?<street>.*) (?<type>St|Ave|Blvd|Rd)" | rex field=address mode=sed "s/ (St|Ave|Blvd|Rd)/ \\1./g" | fields number, street, type, address

# Phone number extraction and formatting
source=contacts | rex field=phone_raw "(?<phone>\\d{3}[-.\\s]?\\d{3}[-.\\s]?\\d{4})" | rex field=phone mode=sed "s/[-.\\s]//g" | rex field=phone mode=sed "s/(\\d{3})(\\d{3})(\\d{4})/\\1-\\2-\\3/" | fields phone

# Credit card masking (using sample data)
source=transactions | rex field=card_number "(?<card_type>^\\d{1})" | rex field=card_number mode=sed "s/^(\\d{4})\\d{8}(\\d{4})/\\1-XXXX-XXXX-\\2/" | fields card_type, card_number
```

### Implementation Details

#### Performance Optimizations
- **Pushdown Support**: Field extraction mode pushes regex filters to OpenSearch for efficient server-side filtering
- **Selective Processing**: Text transformation mode processes all records without filtering overhead
- **Pattern Caching**: Compiled regex patterns are cached for repeated use

#### Query Optimization Example
```bash
# This query pushes the regex filter to OpenSearch
source=logs | rex field=message "(?<level>ERROR|WARN)" | fields message, level

# Explain output shows: REGEXP_CONTAINS pushdown for efficient filtering
```

#### Type System Integration
- **Field Extraction**: Creates new fields with string or array types based on max_match parameter
- **Text Transformation**: Modifies existing field values in-place
- **Type Safety**: Handles null values and type conversion gracefully

### Alternative Approaches

**Basic String Functions**: Limited to simple operations, cannot handle complex patterns or multi-value extraction
```bash
# Limited approach
source=data | eval domain = substr(email, locate(email, "@") + 1)

# Rex approach - more flexible
source=data | rex field=email "(?<domain>[^@]+@(?<domain>[^.]+))"
```

**Multiple Commands**: Requires complex pipeline chains for what rex accomplishes in one step
```bash
# Complex multi-step approach
source=logs | where match(message, "ERROR") | eval level = "ERROR" | eval timestamp = substr(message, 1, 19)

# Rex approach - single command
source=logs | rex field=message "(?<timestamp>\\S+) (?<level>ERROR|WARN|INFO): (?<msg>.*)"
```

### Error Handling

- **Invalid Patterns**: Returns original field values, logs pattern compilation errors
- **Type Mismatches**: Automatically converts non-string fields to strings for processing
- **Null Values**: Gracefully handles null fields, returns null for pattern matching operations
- **Performance**: Includes safeguards against catastrophic backtracking in complex patterns

## Engine Support

Currently implemented in Calcite engine with full OpenSearch integration.

## Open Questions

### Default Field Behavior
How should rex handle cases where no field is specified? - https://github.com/opensearch-project/sql/issues/4111
```bash
# Should this search a default field like _raw?
source=logs | rex "(?<level>ERROR|WARN)"
```

[RFC] Add rex Command to PPL (Calcite Engine) #4108

Description

Problem Statement

Current State

Long-Term Goals

Proposal

High-Level Functionality

Syntax

Field Extraction Mode

Text Transformation Mode

Parameters

Field Extraction Mode

Basic Named Group Extraction

Multi-Value Extraction with max_match

Position Tracking with offset_field

Text Transformation Mode

String Substitution

Character Transliteration

Advanced Use Cases

Combined Extraction and Transformation

Multi-Field Processing

Complex Pattern Examples

Log Analysis

Data Standardization

Implementation Details

Performance Optimizations

Query Optimization Example

Type System Integration

Alternative Approaches

Error Handling

Engine Support

Open Questions

Default Field Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC] Add `rex` Command to PPL (Calcite Engine) #4108