librex-ast is a feature-rich, PCRE2-compatible regular expression engine written in modern C. It provides a complete pipeline for compiling regex patterns into an efficient internal bytecode representation and executing them against subject strings using a backtracking NFA-style virtual machine.
Originally an AST parser, the library has evolved into a complete engine designed for correctness, feature-completeness, and PCRE2 compatibility. It serves as an excellent foundation for building custom tools, educational software, or for anyone interested in the internals of modern regex engines.
This library is a comprehensive implementation of a modern regex engine, capable of passing a large compatibility test suite. However, before using it in a production environment, please be aware of the following:
- Not Yet Performance-Tuned: The VM implementation is focused on correctness and feature support. It has not yet been optimized for high-throughput production workloads and may be significantly slower than mature libraries like PCRE2.
- Simplified Unicode Properties: The \p{...}property support is comprehensive but uses a simplified, non-exhaustive set of Unicode data. A production-ready version would require data generated from the full Unicode Character Database (UCD).
Contributions to address these points are highly welcome!
- PCRE2-Compatible Engine: A complete engine that compiles patterns and matches strings.
- Two-Stage Compilation:
- Parser: A robust recursive-descent parser builds a detailed Abstract Syntax Tree (AST) from the pattern, resolving forward references for constructs like \k<name>.
- Compiler: The AST is then compiled into a compact and efficient bytecode format, where capturing groups become callable subroutines.
 
- Parser: A robust recursive-descent parser builds a detailed Abstract Syntax Tree (AST) from the pattern, resolving forward references for constructs like 
- Backtracking NFA Virtual Machine: The engine uses a custom, non-recursive, stack-based VM to execute the bytecode, providing powerful and correct matching behavior while avoiding C stack overflow.
- Extensive Syntax Support: Implements a wide array of PCRE2/Perl constructs, including atomic groups, possessive quantifiers, advanced assertions, conditionals, and subroutines.
- Structured Error Reporting: The compile function provides a detailed regex_errobject with an error code, message, and the exact position (line,col) of the error in the pattern.
- Pluggable Memory Allocators: The entire library can be configured to use custom malloc,realloc, andfreefunctions for specialized memory environments.
- Clean, Opaque API: The compiled pattern is managed through an opaque regex_compiled*handle, hiding internal complexity.
- Unicode-Aware: Correctly parses UTF-8 patterns, Unicode escape sequences (\x{...}), and Unicode properties (\p{L}).
- Zero Dependencies: Written in pure C99 with no external library dependencies.
The engine operates in three distinct stages:
- Parsing (regex-parser.c): The input pattern string is parsed into an Abstract Syntax Tree (AST). This tree represents the logical structure of the regular expression. This stage validates syntax and reports errors.
- Compilation (regex-match.c): The AST is traversed by a compiler which emits a linear sequence of bytecode instructions. This bytecode is a low-level representation of the matching logic, optimized for the VM.
- Execution (regex-match.c): The generated bytecode is executed by a lightweight, stack-based virtual machine (VM). The VM performs the actual matching against the subject string using a backtracking NFA algorithm, managing alternative paths on an explicit stack.
Here is a simple example of how to compile a pattern, execute a match, and inspect the results.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "regex-parser.h"
int main(void) {
    const char* pattern = "(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})";
    const char* subject = "The date is 2023-11-28.";
    regex_err error = {0};
    
    // 1. Compile the pattern
    regex_compiled* rx = regex_compile(pattern, 0, &error);
    if (!rx) {
        fprintf(stderr, "Failed to compile pattern at position %d: %s\n", 
                error.pos, error.msg);
        return 1;
    }
    
    printf("Successfully compiled pattern: \"%s\"\n", pattern);
    // 2. Execute the match
    regex_match_result result = {0};
    int match_status = regex_match(rx, subject, strlen(subject), &result);
    if (match_status == 1) { // 1 indicates a successful match
        printf("Match found from index %d to %d!\n", result.match_start, result.match_end);
        printf("  - Full match: '%.*s'\n", 
               (int)(result.match_end - result.match_start), 
               subject + result.match_start);
        // Capture groups are 1-indexed for users, but 0-indexed in the result arrays.
        for (int i = 0; i < result.capture_count; i++) {
            if (result.capture_starts[i] != -1) {
                printf("  - Group %d: '%.*s'\n", i + 1,
                       (int)(result.capture_ends[i] - result.capture_starts[i]),
                       subject + result.capture_starts[i]);
            }
        }
        
        // 3. Free the match result's internal arrays
        regex_free_match_result(&result, NULL);
    } else if (match_status == 0) {
        printf("No match found.\n");
    } else {
        // A return value (other than 1 and 0) is a runtime error code
        fprintf(stderr, "Execution error: %s\n", regex_error_message(match_status));
    }
    // 4. Free the compiled regex object
    regex_free(rx);
    return 0;
}- regex_compiled: An opaque pointer representing a compiled regular expression.
- regex_err: A struct containing detailed error information (- code,- pos,- line,- col,- msg).
- regex_match_result: A struct containing the results of a successful match, including overall match bounds and arrays of capture group bounds.
- regex_allocator: A struct allowing you to provide custom memory allocation functions.
- 
regex_compiled* regex_compile(const char* pattern, uint32_t flags, regex_err* error)
 Compiles a null-terminatedpatternstring using standard library allocators. On success, returns an opaqueregex_compiled*handle. On failure, returnsNULLand populates theerrorstruct.
- 
regex_compiled* regex_compile_with_allocator(const char* pattern, uint32_t flags, const regex_allocator* allocator, regex_err* error)
 Same asregex_compilebut uses a customregex_allocator.
- 
void regex_free(regex_compiled* rx)
 Frees all memory associated with a compiled regex handle.
- 
int regex_match(regex_compiled* rx, const char* subject, size_t subject_len, regex_match_result* result)
 Executes a compiled regexrxagainst asubjectstring.- Returns 1on a successful match and populates theresultstruct.
- Returns 0if no match is found.
- Returns a positive REGEX_ERR_*code on a runtime error (e.g.,REGEX_ERR_RECURSION_LIMIT,REGEX_ERR_MEMORY).
 
- Returns 
- 
void regex_free_match_result(regex_match_result* result, const regex_allocator* allocator)
 Frees the capture group arrays within aregex_match_resultstruct. PassNULLfor the allocator to use the standard libraryfree.
- 
const char* regex_error_message(int error_code)
 Returns a static, human-readable string for a givenREGEX_ERR_*code.
- 
void print_regex_ast(const regex_compiled* rx)
 A debugging utility to print a human-readable representation of the internal AST to standard output.
- REG_IGNORECASE: Perform case-insensitive matching.
- REG_MULTILINE:- ^and- $match the start/end of lines, not just the start/end of the string.
- REG_SINGLELINE:- .(dot) metacharacter matches all characters, including newlines.
- REG_EXTENDED: Ignore unescaped whitespace and comments starting with- #.
- REG_UNGREEDY: Invert the greediness of quantifiers (e.g.,- *becomes lazy,- *?becomes greedy).
The engine supports a wide subset of PCRE2/Perl syntax.
- Literal characters (full Unicode support)
- .(dot metacharacter, respects- REG_SINGLELINE)
- Character classes [abc],[^abc],[a-z]
- Pre-defined classes: \d,\D,\w,\W,\s,\S
- Anchors: ^,$,\A,\z,\b(word boundary),\B
- Greedy: *,+,?,{n,m}
- Lazy (non-greedy): *?,+?,??,{n,m}?
- Possessive: *+,++,?+,{n,m}+
- Capturing groups: (...)
- Non-capturing groups: (?:...)
- Named capturing groups: (?<name>...),(?'name'...)
- Atomic groups: (?>...)
- Branch-reset groups: (?|...)
- Positive Lookahead: (?=...)
- Negative Lookahead: (?!...)
- Positive Lookbehind: (?<=...)(fixed-length only)
- Negative Lookbehind: (?<!...)(fixed-length only)
- Numbered backreferences: \1,\2, etc.
- Named backreferences: \k<name>or\k'name'
- Numbered subroutine calls: (?1),(?2), etc.
- Named subroutine calls: (?&name)
- Full pattern recursion: (?R)
- Numeric condition: (?(1)yes|no)
- Named condition: (?(<name>)yes|no)or(?('name')yes|no)
- Assertion condition: (?(?=...)yes|no)
- Inline flags: (?i),(?-m)
- Scoped flags: (?i:...)
- Unicode properties: \p{L},\P{Sc}
- Unicode escapes: \x{...},\xAB,\uABCD
- POSIX character classes: [[:alpha:]],[[:digit:]], etc. (Unicode-aware)
- Comments: (?#...)
- Quoted sequences: \Q...\E
The engine is comprehensive but does not yet implement every feature of PCRE2. The following limitations currently apply:
- Lookbehind Assertions: Must be fixed-length. Variable-length lookbehind (e.g., (?<=a*)) is not supported. The maximum lookbehind length is 255 characters.
- Unicode Support:
- The built-in Unicode property (\p{...}) and POSIX class ([[:alpha:]]) support is based on a partial character database and does not cover all Unicode scripts or categories.
- Grapheme cluster matching (\X) and script run matching are not supported.
 
- The built-in Unicode property (
- Backreferences & Subroutines:
- The \g{...}backreference/subroutine syntax is not supported. Use the alternative forms\k<name>,(?n), or(?&name)instead.
- Recursion/subroutine call depth is limited to 32 (MAX_CALL_DEPTH).
 
- The 
- Control Verbs: Verbs that control the backtracking engine, such as (*SKIP),(*FAIL),(*ACCEPT),(*COMMIT), etc., are not supported.
- Callouts: The (?C...)callout feature for interacting with external C code during a match is not implemented.
- Generic Newline Sequences: The \Rshortcut for matching generic newline sequences is not supported. Use an explicit alternation like\r?\nor a character class[\r\n]instead.
- Engine Optimizations: No AST or bytecode optimization passes (e.g., peephole optimizations) are currently performed, which may impact performance on complex patterns.
The project uses CMake and has no external dependencies. You will need a C99-compatible compiler (e.g., GCC, Clang, MSVC) and CMake (version 3.10 or later).
Follow these steps to build the library and the test suite from the command line:
# 1. Create a directory for an out-of-source build
mkdir build
cd build
# 2. Configure the project using CMake
# On Linux/macOS with Makefiles (default)
cmake ..
# On Windows, you might generate a Visual Studio solution
# cmake .. -G "Visual Studio 17 2022"
# 3. Compile the project
# This universal command works with most generators (Makefiles, Ninja, etc.)
cmake --build .This process will generate the following in your build directory:
- Shared Library: librex_ast.dllon Windows, orliblibrex_ast.so/liblibrex_ast.dylibon Unix-like systems.
- Static Library: librex_ast.libon Windows, orliblibrex_ast.aon Unix-like systems.
- Test Executable: regex_test(orregex_test.exeon Windows).
After a successful build, you can run the test suite from the build directory:
# On Linux/macOS
./regex_test
# On Windows
.\regex_test.exeYou can customize the build with CMake options passed during the configuration step.
- ENABLE_WARNINGS: Toggles extra compiler warnings (e.g.,- -Wall -Wextra). It is- ONby default.
To build with this option disabled:
cmake .. -DENABLE_WARNINGS=OFFThis project is an excellent platform for exploring regex engine design. Contributions are welcome to make it more robust and performant. High-priority areas include:
- Performance Optimization: Profile and optimize the VM loop, reduce memory allocations during matching, and implement bytecode optimizations.
- Full Unicode Support: Integrate code generation scripts to build comprehensive property, script, and character-folding tables from the official Unicode Character Database (UCD).
- JIT Compilation: Explore adding a Just-In-Time (JIT) compiler backend (e.g., targeting x86-64) for performance-critical patterns.
- CI & Testing: Expand the test suite for more comprehensive coverage and set up continuous integration checks.
Please feel free to open an issue to discuss a new feature or submit a pull request.
This project is licensed under the MIT License.
Mounir IDRASSI mounir.idrassi@amcrypto.jp