Description
I decided to use this library for a scripting language I'm working on. Here's a link to the directory that uses it if you're interested: https://github.com/stitchlang/stitch/tree/fedf09a6a522e8e48c963075c84aa44cfc74a951/src
One thing I wanted was to "compile" my regular expressions at compile time rather than runtime. Not only is this more performant, but it means I can have as many as I want (see #3) and I don't need to allocate dynamic memory for them.
To solve this, I first noted that the regex_t
data-structure is very simple:
typedef struct regex_t
{
unsigned char type; /* CHAR, STAR, etc. */
union
{
unsigned char ch; /* the character itself */
unsigned char* ccl; /* OR a pointer to characters in class */
} u;
} regex_t;
I was able to write a function that takes a regex_t
object and generates "C initializtion" code to "recreate itself". This function was easy to write becuase re.c
already has a function named re_print
that does something very similar. Here's what it looks like:
#include <re.h>
static regex_t INLINE_WHITESPACE[] = {
{ .type = 2 }, // BEGIN
{ .type = 8, { .ccl = (unsigned char*)" \t" } }, // CHAR_CLASS
{ .type = 6 }, // PLUS
{ .type = 0 }, // UNUSED
};
static regex_t USER_ID[] = {
{ .type = 2 }, // BEGIN
{ .type = 7, { .ch = '$' } }, // CHAR
{ .type = 8, { .ccl = (unsigned char*)"a-zA-Z0-9_\\." } }, // CHAR_CLASS
{ .type = 6 }, // PLUS
{ .type = 7, { .ch = '$' } }, // CHAR
{ .type = 4 }, // QUESTIONMARK
{ .type = 0 }, // UNUSED
};
static regex_t NEWLINE[] = {
{ .type = 2 }, // BEGIN
{ .type = 7, { .ch = '\n' } }, // CHAR
{ .type = 0 }, // UNUSED
};
static regex_t QUOTED_STRING[] = {
{ .type = 2 }, // BEGIN
{ .type = 7, { .ch = '"' } }, // CHAR
{ .type = 9, { .ccl = (unsigned char*)"\"" } }, // INV_CHAR_CLASS
{ .type = 5 }, // STAR
{ .type = 7, { .ch = '"' } }, // CHAR
{ .type = 0 }, // UNUSED
};
static regex_t COMMENT[] = {
{ .type = 2 }, // BEGIN
{ .type = 7, { .ch = '#' } }, // CHAR
{ .type = 9, { .ccl = (unsigned char*)"\n" } }, // INV_CHAR_CLASS
{ .type = 5 }, // STAR
{ .type = 0 }, // UNUSED
};
static regex_t OPEN_PAREN[] = {
{ .type = 2 }, // BEGIN
{ .type = 7, { .ch = '(' } }, // CHAR
{ .type = 0 }, // UNUSED
};
static regex_t CLOSE_PAREN[] = {
{ .type = 2 }, // BEGIN
{ .type = 7, { .ch = ')' } }, // CHAR
{ .type = 0 }, // UNUSED
};
static regex_t ESCAPE_SEQUENCE[] = {
{ .type = 2 }, // BEGIN
{ .type = 7, { .ch = '@' } }, // CHAR
{ .type = 8, { .ccl = (unsigned char*)"@#$\")(" } }, // CHAR_CLASS
{ .type = 0 }, // UNUSED
};
With the C initialization code, I can compile my regex objects directly into my final binary. Now I don't need to call re_compile
at runtime. I can have as many regular expressions as I want and there's no need for dynamic memory.
Note that one big piece to this puzzle was exposing the regex_t
definition to the user by moving it to the public header file re.h
. Without this, the user would not be able to initalize the objects at compile-time.
For reference here's the function I wrote to generate this initialization code (https://github.com/stitchlang/stitch/blob/fedf09a6a522e8e48c963075c84aa44cfc74a951/src/tokens/compiler.c#L21). As you can see it's quite trivial. If a similar function were added to the library, even in a separate file like cgenerator.c
then others would easily be able to do this as well.