Skip to content

Add Precompiled Regex Code Generator #59

Open
@marler8997

Description

@marler8997

I decided to use this library for a scripting language I'm working on. Here's a link to the directory that uses it if you're interested: https://github.com/stitchlang/stitch/tree/fedf09a6a522e8e48c963075c84aa44cfc74a951/src

One thing I wanted was to "compile" my regular expressions at compile time rather than runtime. Not only is this more performant, but it means I can have as many as I want (see #3) and I don't need to allocate dynamic memory for them.

To solve this, I first noted that the regex_t data-structure is very simple:

typedef struct regex_t
{
  unsigned char  type;   /* CHAR, STAR, etc.                      */
  union
  {
    unsigned char  ch;   /*      the character itself             */
    unsigned char* ccl;  /*  OR  a pointer to characters in class */
  } u;
} regex_t;

I was able to write a function that takes a regex_t object and generates "C initializtion" code to "recreate itself". This function was easy to write becuase re.c already has a function named re_print that does something very similar. Here's what it looks like:

#include <re.h>
static regex_t INLINE_WHITESPACE[] = {
    { .type = 2 }, // BEGIN
    { .type = 8, { .ccl = (unsigned char*)" \t" } }, // CHAR_CLASS
    { .type = 6 }, // PLUS
    { .type = 0 }, // UNUSED
};
static regex_t USER_ID[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '$' } }, // CHAR
    { .type = 8, { .ccl = (unsigned char*)"a-zA-Z0-9_\\." } }, // CHAR_CLASS
    { .type = 6 }, // PLUS
    { .type = 7, { .ch = '$' } }, // CHAR
    { .type = 4 }, // QUESTIONMARK
    { .type = 0 }, // UNUSED
};
static regex_t NEWLINE[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '\n' } }, // CHAR
    { .type = 0 }, // UNUSED
};
static regex_t QUOTED_STRING[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '"' } }, // CHAR
    { .type = 9, { .ccl = (unsigned char*)"\"" } }, // INV_CHAR_CLASS
    { .type = 5 }, // STAR
    { .type = 7, { .ch = '"' } }, // CHAR
    { .type = 0 }, // UNUSED
};
static regex_t COMMENT[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '#' } }, // CHAR
    { .type = 9, { .ccl = (unsigned char*)"\n" } }, // INV_CHAR_CLASS
    { .type = 5 }, // STAR
    { .type = 0 }, // UNUSED
};
static regex_t OPEN_PAREN[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '(' } }, // CHAR
    { .type = 0 }, // UNUSED
};
static regex_t CLOSE_PAREN[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = ')' } }, // CHAR
    { .type = 0 }, // UNUSED
};
static regex_t ESCAPE_SEQUENCE[] = {
    { .type = 2 }, // BEGIN
    { .type = 7, { .ch = '@' } }, // CHAR
    { .type = 8, { .ccl = (unsigned char*)"@#$\")(" } }, // CHAR_CLASS
    { .type = 0 }, // UNUSED
};

With the C initialization code, I can compile my regex objects directly into my final binary. Now I don't need to call re_compile at runtime. I can have as many regular expressions as I want and there's no need for dynamic memory.

Note that one big piece to this puzzle was exposing the regex_t definition to the user by moving it to the public header file re.h. Without this, the user would not be able to initalize the objects at compile-time.

For reference here's the function I wrote to generate this initialization code (https://github.com/stitchlang/stitch/blob/fedf09a6a522e8e48c963075c84aa44cfc74a951/src/tokens/compiler.c#L21). As you can see it's quite trivial. If a similar function were added to the library, even in a separate file like cgenerator.c then others would easily be able to do this as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions