The creg application is a POSIX/GNU regular expression commandline tool for searching with patterns in text-strings or text-files. It implements the functions of the compact-regex.h
extensions library. ( https://github.com/nowca/compact-regex )
- fast regex testing
- text replacement function
- reads large text files (up to 8 MB or more) with parameter or redirected text stream
- structured and colored display output with filters
- file write export
- different output formats and layouts (table, list, plain ASCII, CSV, JSON)
- options of the
regex.h
library with extended functionalites - can be run on Linux, Windows , Mac and all GNU C compatible platforms
- How to use
- Examples
- Installation
- Compilation
- Commandline options
- Supported Regular Expression operations
- POSIX Standard
- Character classes
user@pc:~$ creg "abc DEF xyz ABC 123" "\d+"
- find digit string
\d+
in the textabc DEF xyz ABC 123
user@pc:~$ creg -t "abc DEF xyz ABC 123" -r "abc" -f i
- find string
-r "abc"
in the text-t "abc DEF xyz ABC 123"
-f i
: flag (insensitive case)
user@pc:~$ creg --text "abc DEF xyz ABC 123" --regex "abc" --option-flags i
- find string
--regex "abc"
in the text--text "abc DEF xyz ABC 123"
--option-flags i
: flag (insensitive case)
user@pc:~$ creg -t "abc DEF xyz ABC 123" -r "[\w ]+[^0-9]+" -p plain -d r
- find string of words without numbers
-r "[\w ]+[^0-9]+"
in the text-t "abc DEF xyz ABC 123"
-d r
: display just the results-p plain
: just as text
user@pc:~$ creg -t "abc DEF xyz ABC 123" -r "[\w]+" -p json -d r
- find all words
-r "[\w]+"
in the text-t "abc DEF xyz ABC 123"
-d r
: display just the results-p json
: in csv-format
user@pc:~$ creg -t "abc DEF xyz ABC 123" -r "[a-z0-9]+" -x "###"
- replace all words
-r "[a-z0-9]+"
with lowercase or numbers in the text-t "abc DEF xyz ABC 123"
with the string-x "###"
user@pc:~$ creg -t "abc DEF xyz ABC 123" -r "(a)(b)(c)" -x "\3\2\1" -f gi
- replace each letter of "abc"
-r "(a)(b)(c)"
with the reverse letters "cba"-x "\3\2\1"
with the string###
-f i
: flag (insensitive case)
user@pc:~$ cat service-names-port-numbers.csv | ./creg -r "(\\d+);(.*UDP.*);(.*mail.*);" -c -f gein -d srp
- display the file contents of
service-names-port-numbers.csv
withcat
and readSTDOUT
with piping redirection -r
: match all UDP based protocols which contains the word mail with the options:-c
: colored output-f gein
: flags (global, extended, insensitive case, newline)-d srp
: display statistics, results and index postions
user@pc:~$ ./creg -i ./example-files/oxford-word-list.txt -r "^(Ae.*ion) (.+\.) (.*)$" -p list -f gei -c -d sr
-i
: read in the file./example-files/oxford-word-list.txt
-r
: match all lines (from^
to$
) with words, that start withAe
and end withion
with the options:-c
: colored output-p list
: list-format-f gei
: flags (global, extended, insensitive case)-d sr
: display statistics, results, without the index postions
Z:\>creg.exe /I "example-files\windows-formatted-regfile.reg" /R ".*HKEY-CLASSES_ROOT.*" /D TSR
\I
: read in the fileexample-files\windows-formatted-regfile.reg
\R
: match all lines that contain the phrase "HKEY-CLASSES_ROOT" with the options:\D TSR
: display text, statistics, results, without the index postions
The input file can also be redirected in with the windows-cmd pipeline command:
Z:\>more port-numbers.csv | creg.exe /R "^.*mail.*$" /D sr /F gein /P list
more port-numbers.csv |
: show contents of the file and redirect it with|
\R
: match all lines that contain the phrase "mail" with the options:\D sr
: display statistics, results, without the index postions\F gein
: flags (global, extended, insensitive case, newline)\P list
: short list format
The program can be compiled and copied to the /opt/
folder.
Just run:
user@pc:~$ make
and
user@pc:~$ sudo make install
Build the example program by typing in:
user@pc:~$ make
...or compile it directly with the GNU-C-Compiler:
user@pc:~$ gcc -Wall -static creg.c -o creg
-
The GNU Extensions with the regex.h library are needed for successful compilation. Please take care of including the neccesary header and library files.
-
Use the
-m32
flag to compile the program for 32 Bit systems. -
Important note: The program will be compiled with the
-static
flag, to combine the libraries into the code, there will be some memory leaks showed in valgrind. These errors are supressed on dynamically linking by default. (https://stackoverflow.com/questions/7506134/valgrind-errors-when-linked-with-static-why)
To compile the program on windows, you will need a compiler version with the regex.h library, from GNU extensions included:
C:\Users\pcuser>gcc.exe -static -IC:\MinGW-W64\mingw32\opt\include creg.c -o creg.exe -LC:\MinGW-W64\mingw32\opt\lib -lregex
- MinGW-W64 includes the regex.h library in the
\opt\include
and\opt\lib
folders. - The paths of the header and library must be included with
-I
and-L
, with an additional-lregex
parameter at the end of the command. -static
can be used to make your project independend from libraries.- The path of gcc.exe must be added to the Windows PATH user-variable
To compile the program on MacOS or OS X, you will need a compiler version with the regex.h library, from GNU extensions included:
-
There are several ways to install the GCC development tools on your Mac:
- Xcode
- Homebrew
- MacPorts
- sourcecode compilation
- graphical package installer like Bower or MacUpdate
-
You need a GCC installation with the
regex.h
library (GNU Extensions). -
For compiler options see Linux.
- see
-hc or
--help` to read all the options
creg [Commands] [Options]
Command: | Meaning: |
---|---|
-t <input-text>, --text <input-text> |
text input string |
-r <expression>, --regex <expression> |
regular expression pattern |
-x <replace-text>, --replace <replace-text> |
replacement text substring |
-i <filename>, --input <filename> |
filepath to read in file |
-o <filename>, --output <filename> |
filepath to write out file |
-h, --help |
show help for commands |
Command: | Meaning: |
---|---|
-d <data>, --data <data> |
show output elements |
<data>
:
Argument: | Meaning: |
---|---|
t |
input text |
s |
statistics |
r |
results |
p |
match index positions |
usage example:
-d tsrp
or --data sr
Command: | Meaning: |
---|---|
-p <print-layout>, --print <print-layout> |
printing or file writing layout |
<print-layout>
:
Argument: | Meaning: |
---|---|
table |
table |
list |
short list |
list-full |
full list |
plain |
plain result data |
csv |
comma-seperated values |
json |
JavaScript Object Notation |
Command: | Meaning: |
---|---|
-c, --color |
display with ANSI colors |
Command: | Meaning: |
---|---|
-f <options>, --option-flags <options> |
option-flags for compilation |
<options>
:
Argument: | Meaning: |
---|---|
g : global |
search for all matches in a text |
e : extended |
use Extended Regular Expressions (ERE) |
i : icase |
use insensitive case matching |
m : multiline |
search in multiple lines |
n : newline |
ignore the newline character |
p : nosubexp |
ignore group matching with subexpressions |
q : subexp |
match only subexpressions |
usage example:
-f ge
or --option-flags geinq
default options:
- global, extended, newline (the default options are deactivated, if an option is set with the -f command)
Command: | Meaning: |
---|---|
-s <length>, --max-text-size <length> |
max input-text length in bytes, default: 8388608 bytes (8 MB) |
-n <count>, --max-num-matches <count> |
max number of matches, default: 8192 matches |
The program supports POSIX compatible Regular Expressions from regex.h
with some extended functionalities, like single character classes.
Supported: | Not supported: |
---|---|
Wildcard . |
Lazy *? +? ?? |
Character classes \d \D \w \W |
Negative Lookahead (?!) |
POSIX character classes [:digit:] |
Negative Lookbehind (?<!) |
Whitespace \s \S |
Positive Lookahead (?<=) |
Character Sets [abc] |
Positive Lookbehind (?<=) |
Escaping \ |
|
The Asterisk * |
|
The Plus + |
|
The Question Mark ? |
|
Numeric Quantifier {n} |
|
Range Quantifier {n,m} |
|
Alternation ` | ` |
Anchors ^ $ |
|
Capturing Groups a(b)c |
|
Backreferences \1 |
|
ASCII and Unicode sequences |
Metacharacter | Description |
---|---|
^ | Matches the starting position within the string. In line-based tools, it matches the starting position of any line. |
. | Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c". |
[ ] | A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z]. The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: [abc-], [-abc], [^-abc]. Backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^, if present) character: []abc], [^]abc]. |
[^ ] | Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed. |
$ | Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line. |
( ) | Defines a marked subexpression, also called a capturing group, which is essential for extracting the desired part of the text (See also the next entry, \n). BRE mode requires ( ). |
\n | Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is defined in the POSIX standard.[36] Some tools allow referencing more than nine capturing groups. Also known as a back-reference, this feature is supported in BRE mode. |
* | Matches the preceding element zero or more times. For example, abc matches "ac", "abc", "abbbc", etc. [xyz] matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. (ab)* matches "", "ab", "abab", "ababab", and so on. |
{m,n} | Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only "aaa", "aaaa", and "aaaaa". This is not found in a few older instances of regexes. BRE mode requires {m,n}. |
Metacharacter | Description |
---|---|
? | Matches the preceding element zero or one time. For example, ab?c matches only "ac" or "abc". |
+ | Matches the preceding element one or more times. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac". |
| | The choice (also known as alternation or set union) operator matches either the expression before or the expression after the operator. For example, abc |
Source: https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended
Description | POSIX | Shortcode | ASCII |
---|---|---|---|
ASCII characters | \x[Bytecode] | ||
Alphanumeric characters | [:alnum:] | [A-Za-z0-9] | |
Alphanumeric characters plus "_" | \w | [A-Za-z0-9_] | |
Non-word characters | \W | [^A-Za-z0-9_] | |
Alphabetic characters | [:alpha:] | \a | [A-Za-z] |
Space and tab | [:space:] | \s | |
[:blank:] | \t | ||
Non-whitespace characters | \S | [^ ] | |
Word boundaries | \b | ||
Non-word boundaries | \B | ||
Digits | [:digit:] | \d | [0-9] |
Non-digits | \D | [^0-9] | |
Lowercase letters | [:lower:] | \l | [a-z] |
Uppercase letters | [:upper:] | \u | [A-Z] |
Visible characters | [:print:] | \p | [\x20-\x7E] |