|
| 1 | +What changes between regex and regex-lite |
| 2 | +========================================= |
| 3 | + |
| 4 | +Performance loss |
| 5 | +---------------- |
| 6 | + |
| 7 | +Refer to the [Benchmark](#Benchmark) section for specific numbers. |
| 8 | + |
| 9 | +Artifact size reduction |
| 10 | +----------------------- |
| 11 | + |
| 12 | +Refer to the [PR comment](https://github.com/DataDog/libdatadog/pull/1232#issuecomment-3318665873) to evaluate the impact. |
| 13 | + |
| 14 | +Unicode correctness |
| 15 | +------------------- |
| 16 | + |
| 17 | +This one needs more detail since it's not just numbers. Basically, unicode edge-cases (especially around multi codepoint characters) contributed a lot to the complexity (and thus size of the automatas). |
| 18 | + |
| 19 | +### What regex-lite still does for Unicode |
| 20 | + |
| 21 | +Both engines fundamentally match Unicode scalar values (code points) in &str haystacks; . matches a full code point (not a single byte) unless you explicitly disable Unicode in regex. regex-lite therefore can match arbitrary non-ASCII characters literally—it just lacks the higher-level Unicode semantics of the following sections. |
| 22 | + |
| 23 | +```rs |
| 24 | +// Literal Unicode still works in both |
| 25 | +assert!(regex_lite::Regex::new("ΔδΔ").unwrap().is_match("xxΔδΔyy")); |
| 26 | +``` |
| 27 | + |
| 28 | + |
| 29 | +### No Unicode properties (\p{...} / \P{...}) |
| 30 | + |
| 31 | +`regex`: Supports Unicode general categories, scripts, script extensions, ages and many boolean properties via `\p{...}` / `\P{...}` (e.g., `\p{Letter}`, `\p{Greek}`, `\p{Emoji}`, `\p{Age:6.0}`), and lets you combine them inside classes. |
| 32 | + |
| 33 | +regex-lite: Does not support `\p{...}`/`\P{...}` at all (patterns using them won't compile). |
| 34 | + |
| 35 | +```rs |
| 36 | +// regex: OK – matches all Greek letters |
| 37 | +let re = regex::Regex::new(r"\p{Greek}+").unwrap(); |
| 38 | +assert!(re.is_match("ΔδΔ")); |
| 39 | + |
| 40 | +// regex-lite: compile error – \p{...} unsupported |
| 41 | +let re = regex_lite::Regex::new(r"\p{Greek}+").unwrap(); // ERROR |
| 42 | +``` |
| 43 | + |
| 44 | +### ASCII-only "Perl classes" (`\w`, `\d`, `\s`) and word boundaries |
| 45 | + |
| 46 | +regex: In Unicode mode (default), `\w`, `\d`, `\s` are Unicode-aware; `\b`/`\B` use Unicode's notion of "word" characters. ASCII-only variants are opt-in via (?-u:...). |
| 47 | + |
| 48 | +regex-lite: `\w`, `\d`, `\s` are ASCII only (`\w` = [0-9A-Za-z_], `\d` = [0-9], `\s` = [`\t`n``v``f``r ``]). Since `\w` is ASCII-only, word boundaries behave accordingly (i.e., effectively ASCII). |
| 49 | + |
| 50 | +```rs |
| 51 | +// \w on non-ASCII letters |
| 52 | +assert!(regex::Regex::new(r"^\w+$").unwrap().is_match("résumé")); // true (Unicode-aware) |
| 53 | +assert!(!regex_lite::Regex::new(r"^\w+$").unwrap().is_match("résumé"));// false (ASCII-only) |
| 54 | + |
| 55 | +// \d on non-ASCII digits (e.g., Devanagari '३') |
| 56 | +assert!(regex::Regex::new(r"^\d$").unwrap().is_match("३")); // true |
| 57 | +assert!(!regex_lite::Regex::new(r"^\d$").unwrap().is_match("३")); // false |
| 58 | + |
| 59 | +// \b word boundary with non-ASCII letters |
| 60 | +assert!(regex::Regex::new(r"\bword\b").unwrap().is_match("… wordًا …")); // true |
| 61 | +assert!(!regex_lite::Regex::new(r"\bword\b").unwrap().is_match("… wordًا …")); // often false |
| 62 | +``` |
| 63 | + |
| 64 | +### No Unicode-aware case-insensitive matching |
| 65 | + |
| 66 | +regex: (`?i`) is Unicode-aware (uses Unicode "simple case folding"). E.g., Δ matches δ under (?i). |
| 67 | + |
| 68 | +regex-lite: (`?i`) is ASCII-only; non-ASCII letters won't fold. |
| 69 | + |
| 70 | +```rs |
| 71 | +assert!(regex::Regex::new(r"(?i)Δ+").unwrap().is_match("ΔδΔ")); // true |
| 72 | +assert!(!regex_lite::Regex::new(r"(?i)Δ+").unwrap().is_match("ΔδΔ")); // false |
| 73 | +``` |
| 74 | + |
| 75 | +### No Unicode-centric character-class set ops beyond union |
| 76 | + |
| 77 | +regex: Inside \[...\], supports intersection &&, difference --, symmetric difference ~~, and nested classes—very handy with Unicode properties (e.g., Greek letters only). |
| 78 | + |
| 79 | +regex-lite: Only union is supported; &&, --, ~~ are not. |
| 80 | + |
| 81 | +```rs |
| 82 | +// regex: Greek letters only (Greek ∩ Letter) |
| 83 | +let re = regex::Regex::new(r"[\p{Greek}&&\pL]+").unwrap(); // OK |
| 84 | +// regex-lite: ERROR – intersection unsupported, and \p{…} unsupported |
| 85 | +let re = regex_lite::Regex::new(r"[\p{Greek}&&\pL]+").unwrap(); // ERROR |
| 86 | +``` |
| 87 | + |
| 88 | +### No "Unicode Perl classes" feature or Unicode word-boundary tables |
| 89 | + |
| 90 | +regex: Its Unicode feature set includes dedicated data for Unicode-aware `\w`, `\s`, `\d` and for Unicode word-boundary logic; these are part of its documented Unicode features. |
| 91 | + |
| 92 | +regex-lite: Opts out of "robust Unicode support" entirely; there are no Unicode data tables enabling those behaviors. |
| 93 | + |
| 94 | +```rs |
| 95 | +// Unicode whitespace (e.g., NO-BREAK SPACE \u{00A0}) |
| 96 | +assert!(regex::Regex::new(r"\s").unwrap().is_match("\u{00A0}")); // true |
| 97 | +assert!(!regex_lite::Regex::new(r"\s").unwrap().is_match("\u{00A0}")); // false |
| 98 | +``` |
| 99 | + |
1 | 100 | Benchmark |
2 | 101 | ========= |
3 | 102 |
|
4 | 103 | ```sh |
5 | 104 | $ cargo bench |
6 | 105 | ``` |
7 | 106 |
|
8 | | -Results: |
| 107 | +``` |
| 108 | + Finished `bench` profile [optimized] target(s) in 0.00s |
| 109 | + Running unittests src/lib.rs (target/release/deps/cgroup_parse_bench-c1b1b28a53195300) |
| 110 | +
|
| 111 | +running 18 tests |
| 112 | +test benches::cdefine_hand_long ... bench: 1,356,066.66 ns/iter (+/- 117,849.80) |
| 113 | +test benches::cdefine_hand_med ... bench: 1,329,366.68 ns/iter (+/- 132,407.79) |
| 114 | +test benches::cdefine_hand_short ... bench: 1,286,426.56 ns/iter (+/- 125,846.09) |
| 115 | +test benches::cdefine_regex_lite_long ... bench: 21,279,454.20 ns/iter (+/- 549,742.88) |
| 116 | +test benches::cdefine_regex_lite_med ... bench: 21,216,983.30 ns/iter (+/- 721,263.72) |
| 117 | +test benches::cdefine_regex_lite_short ... bench: 20,854,745.80 ns/iter (+/- 783,203.29) |
| 118 | +test benches::cdefine_regex_long ... bench: 2,671,409.38 ns/iter (+/- 69,159.82) |
| 119 | +test benches::cdefine_regex_med ... bench: 2,654,479.15 ns/iter (+/- 97,009.58) |
| 120 | +test benches::cdefine_regex_short ... bench: 2,668,466.60 ns/iter (+/- 67,595.87) |
| 121 | +test benches::cgroup_hand_long ... bench: 1,281,364.55 ns/iter (+/- 57,103.09) |
| 122 | +test benches::cgroup_hand_med ... bench: 1,121,480.21 ns/iter (+/- 41,114.21) |
| 123 | +test benches::cgroup_hand_short ... bench: 1,095,742.20 ns/iter (+/- 44,502.66) |
| 124 | +test benches::cgroup_regex_lite_long ... bench: 256,230,283.40 ns/iter (+/- 3,334,367.42) |
| 125 | +test benches::cgroup_regex_lite_med ... bench: 114,994,416.60 ns/iter (+/- 1,492,667.53) |
| 126 | +test benches::cgroup_regex_lite_short ... bench: 57,620,791.60 ns/iter (+/- 1,935,595.32) |
| 127 | +test benches::cgroup_regex_long ... bench: 20,438,225.10 ns/iter (+/- 797,679.19) |
| 128 | +test benches::cgroup_regex_med ... bench: 10,140,350.00 ns/iter (+/- 607,504.16) |
| 129 | +test benches::cgroup_regex_short ... bench: 5,910,920.80 ns/iter (+/- 278,048.02) |
| 130 | +
|
| 131 | +test result: ok. 0 passed; 0 failed; 0 ignored; 18 measured; 0 filtered out; finished in 197.21s |
| 132 | +``` |
0 commit comments