-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Description
The RexExtractFunction
and RexExtractMultiFunction
currently compile regex patterns on every invocation, which can be a performance bottleneck for queries processing large datasets. We should implement pattern caching to improve performance.
Current Implementation
From #4109, currently in, RexExtractFunction.java:54-56
and RexExtractMultiFunction.java
:
public static String extractGroup(String text, String pattern, int groupIndex) {
try {
Pattern compiledPattern = Pattern.compile(pattern); // Compiled every time
Matcher matcher = compiledPattern.matcher(text);
// ...
}
}
Problem
-
Pattern compilation is expensive: The
Pattern.compile()
operation involves parsing the regex string and building an internal state machine, which is computationally expensive. -
Repeated compilation: In a typical query like
source=logs | rex field=message "(?<user>\w+)@(?<domain>\w+\.com)"
, the same pattern is compiled once for every row in the dataset. -
Impact at scale: For a dataset with millions of rows, this results in millions of redundant pattern compilations of the exact same regex.
Proposed Solution
Implement a pattern cache similar to Apache Calcite's approach:
public final class RexExtractFunction extends ImplementorUDF {
// Cache compiled patterns with max size and expiration
private static final LoadingCache<String, Pattern> PATTERN_CACHE =
CacheBuilder.newBuilder()
.maximumSize(256) // Limit cache size to prevent memory issues
.expireAfterAccess(1, TimeUnit.HOURS) // Expire unused patterns
.build(CacheLoader.from(pattern -> {
try {
return Pattern.compile(pattern);
} catch (PatternSyntaxException e) {
throw new IllegalArgumentException("Invalid regex pattern: " + pattern, e);
}
}));
public static String extractGroup(String text, String pattern, int groupIndex) {
try {
Pattern compiledPattern = PATTERN_CACHE.get(pattern); // Reuse compiled pattern
Matcher matcher = compiledPattern.matcher(text);
// ...
} catch (ExecutionException e) {
// Handle cache loading exception
}
}
}
Benefits
- Performance improvement: Benchmarks show pattern caching can improve regex operations by 10-100x for repeated patterns
- Reduced CPU usage: Eliminates redundant compilation work
- Better scalability: More efficient processing of large datasets
- Memory bounded: Cache size limits prevent unbounded memory growth
Implementation Considerations / Exit Criteria
- Cache configuration: Need to determine optimal cache size and expiration settings
- Thread safety: Guava's LoadingCache is thread-safe by default
- Error handling: Need to properly handle cache loading exceptions
- Monitoring: Consider adding metrics for cache hit/miss rates
References
- Apache Calcite pattern caching implementation: SqlFunctions.java#L461-L475
- Similar optimization in Elasticsearch Grok processor
- Java Pattern compilation performance analysis
Priority
Medium - This is a performance optimization that becomes more important as data volumes grow. While not blocking functionality, it can significantly improve query performance for production workloads.
Affected Components
RexExtractFunction.java
RexExtractMultiFunction.java
- Potentially other regex-based functions in the codebase
Estimated Effort
Small - 1-2 days including implementation, testing, and benchmarking
Metadata
Metadata
Assignees
Labels
Type
Projects
Status