Skip to content

[RFC] Add rex Command to PPL (Calcite Engine) #4108

@RyanL1997

Description

@RyanL1997

Problem Statement

In data analysis and log processing workflows, users frequently need to extract structured information from unstructured text fields and perform text transformations, for example:

  • Extracting email usernames and domains from user registration data
  • Parsing log entries to extract timestamp, severity level, and error codes
  • Transforming sensitive data by masking or anonymizing field values
  • Extracting multiple values from comma-separated or space-delimited fields
  • Converting field formats for standardization (e.g., phone numbers, addresses)

Currently, PPL lacks comprehensive text processing capabilities, forcing users to rely on basic string functions or external processing tools. This limits the ability to perform complex field extraction and transformation tasks within PPL pipelines.

Current State

  • PPL provides basic string functions like substr, concat, and trim
  • Limited pattern-based field extraction capabilities
  • No native support for named capture groups or complex regex operations
  • Users must resort to multiple commands or external tools for text processing workflows
  • No standardized approach for field masking or data anonymization

Long-Term Goals

  • Provide comprehensive text processing capabilities within PPL pipelines
  • Enable efficient server-side pattern matching and field extraction
  • Support both field extraction and in-place text transformation operations
  • Establish consistent patterns for text processing commands in PPL
  • Enable complex data parsing workflows for log analysis and data preparation

Proposal

High-Level Functionality

The rex command provides two primary modes for text processing:

  1. Field Extraction Mode (Default): Extract structured data from text using named capture groups
  2. Text Transformation Mode: Modify field values using find-and-replace operations

Syntax

Field Extraction Mode

source | rex [field=<field>] "<pattern>" [max_match=<int>] [offset_field=<string>]

Text Transformation Mode

source | rex [field=<field>] mode=sed "<sed-expression>"

Parameters

  • field: Source field to process (defaults to _raw field handling)
  • pattern: Regular expression with named capture groups for extraction
  • max_match: Controls number of matches extracted (1=single value, >1=array, 0=unlimited)
  • offset_field: Creates field tracking character positions of matches
  • mode=sed: Enables text transformation mode using sed-style expressions

Field Extraction Mode

Basic Named Group Extraction

# Extract email components
source=accounts | rex field=email "(?<username>[^@]+)@(?<domain>[^.]+)\.(?<tld>.+)" | fields username, domain, tld

# Parse log entries  
source=logs | rex field=message "(?<timestamp>\\d{4}-\\d{2}-\\d{2}) (?<level>\\w+) (?<component>\\w+): (?<msg>.*)" | fields timestamp, level, component, msg

# Extract multiple phone number parts
source=contacts | rex field=phone "(?<country>\\+\\d{1,3})-(?<area>\\d{3})-(?<number>\\d{3}-\\d{4})" | fields country, area, number

Multi-Value Extraction with max_match

# Extract all words from a text field (unlimited)
source=documents | rex field=content "(?<words>\\w+)" max_match=0 | fields content, words

# Extract first 3 numbers from address
source=addresses | rex field=address "(?<numbers>\\d+)" max_match=3 | fields address, numbers

# Extract hashtags from social media posts  
source=posts | rex field=text "(?<hashtags>#\\w+)" max_match=5 | fields text, hashtags

Position Tracking with offset_field

# Track where email username appears in text
source=emails | rex field=message "(?<email>[^@]+@[^\\s]+)" offset_field=email_positions | fields message, email, email_positions

# Find positions of all phone numbers in document
source=documents | rex field=text "(?<phone>\\d{3}-\\d{3}-\\d{4})" max_match=0 offset_field=phone_locations | fields text, phone, phone_locations

Text Transformation Mode

String Substitution

# Mask email domains for privacy
source=users | rex field=email mode=sed "s/@.*/@company.com/" | fields email

# Replace all digits with X for anonymization
source=logs | rex field=client_ip mode=sed "s/\\d/X/g" | fields client_ip

# Standardize phone number format
source=contacts | rex field=phone mode=sed "s/\\((\\d{3})\\)\\s*(\\d{3})-(\\d{4})/\\1-\\2-\\3/" | fields phone

Character Transliteration

# Convert special characters for URL safety
source=articles | rex field=title mode=sed "y/ /_/" | fields title

# Remove vowels from sensitive field
source=data | rex field=sensitive mode=sed "y/aeiouAEIOU/_________/" | fields sensitive

# Convert case manually (demonstration)
source=names | rex field=first_name mode=sed "y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/" | fields first_name  

Advanced Use Cases

Combined Extraction and Transformation

# Extract original data, then create masked version
source=customers | rex field=email "(?<username>[^@]+)" | rex field=email mode=sed "s/^(.{2}).*@/\\1***@/" | fields email, username

# Parse structured data from multiple formats
source=logs | rex field=raw_log "(?<timestamp>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2})" | rex field=raw_log mode=sed "s/T/ /" | fields raw_log, timestamp

Multi-Field Processing

# Extract from one field, transform another
source=users | rex field=full_name "(?<first>\\w+) (?<last>\\w+)" | rex field=email mode=sed "s/@.*/@anonymous.com/" | fields first, last, email

# Process multiple fields with different patterns
source=products | rex field=sku "(?<category>[A-Z]+)-(?<id>\\d+)" | rex field=description mode=sed "s/\\$[0-9.]+/[PRICE_HIDDEN]/g" | fields category, id, description

Complex Pattern Examples

Log Analysis

# Apache access log parsing
source=apache_logs | rex field=log "(?<ip>\\d+\\.\\d+\\.\\d+\\.\\d+) - - \\[(?<timestamp>[^\\]]+)\\] \"(?<method>\\w+) (?<url>[^\\s]+) (?<protocol>[^\"]+)\" (?<status>\\d+) (?<size>\\d+)" | fields ip, timestamp, method, url, status, size

# JSON log extraction  
source=app_logs | rex field=message "\"level\":\"(?<log_level>\\w+)\",\"msg\":\"(?<log_message>[^\"]+)\",\"time\":\"(?<log_time>[^\"]+)\"" | fields log_level, log_message, log_time

# Error pattern extraction with multiple matches
source=error_logs | rex field=stacktrace "(?<errors>\\w+Exception)" max_match=0 | rex field=stacktrace "(?<methods>\\w+\\.\\w+\\()" max_match=5 | fields errors, methods

Data Standardization

# Address parsing and standardization
source=addresses | rex field=address "(?<number>\\d+) (?<street>.*) (?<type>St|Ave|Blvd|Rd)" | rex field=address mode=sed "s/ (St|Ave|Blvd|Rd)/ \\1./g" | fields number, street, type, address

# Phone number extraction and formatting
source=contacts | rex field=phone_raw "(?<phone>\\d{3}[-.\\s]?\\d{3}[-.\\s]?\\d{4})" | rex field=phone mode=sed "s/[-.\\s]//g" | rex field=phone mode=sed "s/(\\d{3})(\\d{3})(\\d{4})/\\1-\\2-\\3/" | fields phone

# Credit card masking (using sample data)
source=transactions | rex field=card_number "(?<card_type>^\\d{1})" | rex field=card_number mode=sed "s/^(\\d{4})\\d{8}(\\d{4})/\\1-XXXX-XXXX-\\2/" | fields card_type, card_number

Implementation Details

Performance Optimizations

  • Pushdown Support: Field extraction mode pushes regex filters to OpenSearch for efficient server-side filtering
  • Selective Processing: Text transformation mode processes all records without filtering overhead
  • Pattern Caching: Compiled regex patterns are cached for repeated use

Query Optimization Example

# This query pushes the regex filter to OpenSearch
source=logs | rex field=message "(?<level>ERROR|WARN)" | fields message, level

# Explain output shows: REGEXP_CONTAINS pushdown for efficient filtering

Type System Integration

  • Field Extraction: Creates new fields with string or array types based on max_match parameter
  • Text Transformation: Modifies existing field values in-place
  • Type Safety: Handles null values and type conversion gracefully

Alternative Approaches

Basic String Functions: Limited to simple operations, cannot handle complex patterns or multi-value extraction

# Limited approach
source=data | eval domain = substr(email, locate(email, "@") + 1)

# Rex approach - more flexible
source=data | rex field=email "(?<domain>[^@]+@(?<domain>[^.]+))"

Multiple Commands: Requires complex pipeline chains for what rex accomplishes in one step

# Complex multi-step approach
source=logs | where match(message, "ERROR") | eval level = "ERROR" | eval timestamp = substr(message, 1, 19)

# Rex approach - single command
source=logs | rex field=message "(?<timestamp>\\S+) (?<level>ERROR|WARN|INFO): (?<msg>.*)"

Error Handling

  • Invalid Patterns: Returns original field values, logs pattern compilation errors
  • Type Mismatches: Automatically converts non-string fields to strings for processing
  • Null Values: Gracefully handles null fields, returns null for pattern matching operations
  • Performance: Includes safeguards against catastrophic backtracking in complex patterns

Engine Support

Currently implemented in Calcite engine with full OpenSearch integration.

Open Questions

Default Field Behavior

How should rex handle cases where no field is specified? - #4111

# Should this search a default field like _raw?
source=logs | rex "(?<level>ERROR|WARN)"

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagecalcitecalcite migration releatedfeaturev3.3.0

Type

No type

Projects

Status

New

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions