-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Problem Statement
In data analysis and log processing workflows, users frequently need to extract structured information from unstructured text fields and perform text transformations, for example:
- Extracting email usernames and domains from user registration data
- Parsing log entries to extract timestamp, severity level, and error codes
- Transforming sensitive data by masking or anonymizing field values
- Extracting multiple values from comma-separated or space-delimited fields
- Converting field formats for standardization (e.g., phone numbers, addresses)
Currently, PPL lacks comprehensive text processing capabilities, forcing users to rely on basic string functions or external processing tools. This limits the ability to perform complex field extraction and transformation tasks within PPL pipelines.
Current State
- PPL provides basic string functions like
substr
,concat
, andtrim
- Limited pattern-based field extraction capabilities
- No native support for named capture groups or complex regex operations
- Users must resort to multiple commands or external tools for text processing workflows
- No standardized approach for field masking or data anonymization
Long-Term Goals
- Provide comprehensive text processing capabilities within PPL pipelines
- Enable efficient server-side pattern matching and field extraction
- Support both field extraction and in-place text transformation operations
- Establish consistent patterns for text processing commands in PPL
- Enable complex data parsing workflows for log analysis and data preparation
Proposal
High-Level Functionality
The rex
command provides two primary modes for text processing:
- Field Extraction Mode (Default): Extract structured data from text using named capture groups
- Text Transformation Mode: Modify field values using find-and-replace operations
Syntax
Field Extraction Mode
source | rex [field=<field>] "<pattern>" [max_match=<int>] [offset_field=<string>]
Text Transformation Mode
source | rex [field=<field>] mode=sed "<sed-expression>"
Parameters
- field: Source field to process (defaults to _raw field handling)
- pattern: Regular expression with named capture groups for extraction
- max_match: Controls number of matches extracted (1=single value, >1=array, 0=unlimited)
- offset_field: Creates field tracking character positions of matches
- mode=sed: Enables text transformation mode using sed-style expressions
Field Extraction Mode
Basic Named Group Extraction
# Extract email components
source=accounts | rex field=email "(?<username>[^@]+)@(?<domain>[^.]+)\.(?<tld>.+)" | fields username, domain, tld
# Parse log entries
source=logs | rex field=message "(?<timestamp>\\d{4}-\\d{2}-\\d{2}) (?<level>\\w+) (?<component>\\w+): (?<msg>.*)" | fields timestamp, level, component, msg
# Extract multiple phone number parts
source=contacts | rex field=phone "(?<country>\\+\\d{1,3})-(?<area>\\d{3})-(?<number>\\d{3}-\\d{4})" | fields country, area, number
Multi-Value Extraction with max_match
# Extract all words from a text field (unlimited)
source=documents | rex field=content "(?<words>\\w+)" max_match=0 | fields content, words
# Extract first 3 numbers from address
source=addresses | rex field=address "(?<numbers>\\d+)" max_match=3 | fields address, numbers
# Extract hashtags from social media posts
source=posts | rex field=text "(?<hashtags>#\\w+)" max_match=5 | fields text, hashtags
Position Tracking with offset_field
# Track where email username appears in text
source=emails | rex field=message "(?<email>[^@]+@[^\\s]+)" offset_field=email_positions | fields message, email, email_positions
# Find positions of all phone numbers in document
source=documents | rex field=text "(?<phone>\\d{3}-\\d{3}-\\d{4})" max_match=0 offset_field=phone_locations | fields text, phone, phone_locations
Text Transformation Mode
String Substitution
# Mask email domains for privacy
source=users | rex field=email mode=sed "s/@.*/@company.com/" | fields email
# Replace all digits with X for anonymization
source=logs | rex field=client_ip mode=sed "s/\\d/X/g" | fields client_ip
# Standardize phone number format
source=contacts | rex field=phone mode=sed "s/\\((\\d{3})\\)\\s*(\\d{3})-(\\d{4})/\\1-\\2-\\3/" | fields phone
Character Transliteration
# Convert special characters for URL safety
source=articles | rex field=title mode=sed "y/ /_/" | fields title
# Remove vowels from sensitive field
source=data | rex field=sensitive mode=sed "y/aeiouAEIOU/_________/" | fields sensitive
# Convert case manually (demonstration)
source=names | rex field=first_name mode=sed "y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/" | fields first_name
Advanced Use Cases
Combined Extraction and Transformation
# Extract original data, then create masked version
source=customers | rex field=email "(?<username>[^@]+)" | rex field=email mode=sed "s/^(.{2}).*@/\\1***@/" | fields email, username
# Parse structured data from multiple formats
source=logs | rex field=raw_log "(?<timestamp>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2})" | rex field=raw_log mode=sed "s/T/ /" | fields raw_log, timestamp
Multi-Field Processing
# Extract from one field, transform another
source=users | rex field=full_name "(?<first>\\w+) (?<last>\\w+)" | rex field=email mode=sed "s/@.*/@anonymous.com/" | fields first, last, email
# Process multiple fields with different patterns
source=products | rex field=sku "(?<category>[A-Z]+)-(?<id>\\d+)" | rex field=description mode=sed "s/\\$[0-9.]+/[PRICE_HIDDEN]/g" | fields category, id, description
Complex Pattern Examples
Log Analysis
# Apache access log parsing
source=apache_logs | rex field=log "(?<ip>\\d+\\.\\d+\\.\\d+\\.\\d+) - - \\[(?<timestamp>[^\\]]+)\\] \"(?<method>\\w+) (?<url>[^\\s]+) (?<protocol>[^\"]+)\" (?<status>\\d+) (?<size>\\d+)" | fields ip, timestamp, method, url, status, size
# JSON log extraction
source=app_logs | rex field=message "\"level\":\"(?<log_level>\\w+)\",\"msg\":\"(?<log_message>[^\"]+)\",\"time\":\"(?<log_time>[^\"]+)\"" | fields log_level, log_message, log_time
# Error pattern extraction with multiple matches
source=error_logs | rex field=stacktrace "(?<errors>\\w+Exception)" max_match=0 | rex field=stacktrace "(?<methods>\\w+\\.\\w+\\()" max_match=5 | fields errors, methods
Data Standardization
# Address parsing and standardization
source=addresses | rex field=address "(?<number>\\d+) (?<street>.*) (?<type>St|Ave|Blvd|Rd)" | rex field=address mode=sed "s/ (St|Ave|Blvd|Rd)/ \\1./g" | fields number, street, type, address
# Phone number extraction and formatting
source=contacts | rex field=phone_raw "(?<phone>\\d{3}[-.\\s]?\\d{3}[-.\\s]?\\d{4})" | rex field=phone mode=sed "s/[-.\\s]//g" | rex field=phone mode=sed "s/(\\d{3})(\\d{3})(\\d{4})/\\1-\\2-\\3/" | fields phone
# Credit card masking (using sample data)
source=transactions | rex field=card_number "(?<card_type>^\\d{1})" | rex field=card_number mode=sed "s/^(\\d{4})\\d{8}(\\d{4})/\\1-XXXX-XXXX-\\2/" | fields card_type, card_number
Implementation Details
Performance Optimizations
- Pushdown Support: Field extraction mode pushes regex filters to OpenSearch for efficient server-side filtering
- Selective Processing: Text transformation mode processes all records without filtering overhead
- Pattern Caching: Compiled regex patterns are cached for repeated use
Query Optimization Example
# This query pushes the regex filter to OpenSearch
source=logs | rex field=message "(?<level>ERROR|WARN)" | fields message, level
# Explain output shows: REGEXP_CONTAINS pushdown for efficient filtering
Type System Integration
- Field Extraction: Creates new fields with string or array types based on max_match parameter
- Text Transformation: Modifies existing field values in-place
- Type Safety: Handles null values and type conversion gracefully
Alternative Approaches
Basic String Functions: Limited to simple operations, cannot handle complex patterns or multi-value extraction
# Limited approach
source=data | eval domain = substr(email, locate(email, "@") + 1)
# Rex approach - more flexible
source=data | rex field=email "(?<domain>[^@]+@(?<domain>[^.]+))"
Multiple Commands: Requires complex pipeline chains for what rex accomplishes in one step
# Complex multi-step approach
source=logs | where match(message, "ERROR") | eval level = "ERROR" | eval timestamp = substr(message, 1, 19)
# Rex approach - single command
source=logs | rex field=message "(?<timestamp>\\S+) (?<level>ERROR|WARN|INFO): (?<msg>.*)"
Error Handling
- Invalid Patterns: Returns original field values, logs pattern compilation errors
- Type Mismatches: Automatically converts non-string fields to strings for processing
- Null Values: Gracefully handles null fields, returns null for pattern matching operations
- Performance: Includes safeguards against catastrophic backtracking in complex patterns
Engine Support
Currently implemented in Calcite engine with full OpenSearch integration.
Open Questions
Default Field Behavior
How should rex handle cases where no field is specified? - #4111
# Should this search a default field like _raw?
source=logs | rex "(?<level>ERROR|WARN)"
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status