Skip to content

[Enhancement] Add Pattern Caching to Rex Command Functions #4235

@RyanL1997

Description

@RyanL1997

Description

The RexExtractFunction and RexExtractMultiFunction currently compile regex patterns on every invocation, which can be a performance bottleneck for queries processing large datasets. We should implement pattern caching to improve performance.

Current Implementation

From #4109, currently in, RexExtractFunction.java:54-56 and RexExtractMultiFunction.java:

public static String extractGroup(String text, String pattern, int groupIndex) {
  try {
    Pattern compiledPattern = Pattern.compile(pattern);  // Compiled every time
    Matcher matcher = compiledPattern.matcher(text);
    // ...
  }
}

Problem

  1. Pattern compilation is expensive: The Pattern.compile() operation involves parsing the regex string and building an internal state machine, which is computationally expensive.

  2. Repeated compilation: In a typical query like source=logs | rex field=message "(?<user>\w+)@(?<domain>\w+\.com)", the same pattern is compiled once for every row in the dataset.

  3. Impact at scale: For a dataset with millions of rows, this results in millions of redundant pattern compilations of the exact same regex.

Proposed Solution

Implement a pattern cache similar to Apache Calcite's approach:

public final class RexExtractFunction extends ImplementorUDF {
  // Cache compiled patterns with max size and expiration
  private static final LoadingCache<String, Pattern> PATTERN_CACHE = 
    CacheBuilder.newBuilder()
      .maximumSize(256)  // Limit cache size to prevent memory issues
      .expireAfterAccess(1, TimeUnit.HOURS)  // Expire unused patterns
      .build(CacheLoader.from(pattern -> {
        try {
          return Pattern.compile(pattern);
        } catch (PatternSyntaxException e) {
          throw new IllegalArgumentException("Invalid regex pattern: " + pattern, e);
        }
      }));

  public static String extractGroup(String text, String pattern, int groupIndex) {
    try {
      Pattern compiledPattern = PATTERN_CACHE.get(pattern);  // Reuse compiled pattern
      Matcher matcher = compiledPattern.matcher(text);
      // ...
    } catch (ExecutionException e) {
      // Handle cache loading exception
    }
  }
}

Benefits

  1. Performance improvement: Benchmarks show pattern caching can improve regex operations by 10-100x for repeated patterns
  2. Reduced CPU usage: Eliminates redundant compilation work
  3. Better scalability: More efficient processing of large datasets
  4. Memory bounded: Cache size limits prevent unbounded memory growth

Implementation Considerations / Exit Criteria

  1. Cache configuration: Need to determine optimal cache size and expiration settings
  2. Thread safety: Guava's LoadingCache is thread-safe by default
  3. Error handling: Need to properly handle cache loading exceptions
  4. Monitoring: Consider adding metrics for cache hit/miss rates

References

  • Apache Calcite pattern caching implementation: SqlFunctions.java#L461-L475
  • Similar optimization in Elasticsearch Grok processor
  • Java Pattern compilation performance analysis

Priority

Medium - This is a performance optimization that becomes more important as data volumes grow. While not blocking functionality, it can significantly improve query performance for production workloads.

Affected Components

  • RexExtractFunction.java
  • RexExtractMultiFunction.java
  • Potentially other regex-based functions in the codebase

Estimated Effort

Small - 1-2 days including implementation, testing, and benchmarking

Metadata

Metadata

Assignees

No one assigned

    Labels

    PPLPiped processing languagecalcitecalcite migration releatedenhancementNew feature or request

    Type

    No type

    Projects

    Status

    Not Started

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions