Skip to content

Commit 71fad1e

Browse files
committed
feat(regex-experiments): readme
1 parent 42aaebb commit 71fad1e

File tree

1 file changed

+125
-1
lines changed

1 file changed

+125
-1
lines changed

regex-experiments/README.md

Lines changed: 125 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,132 @@
1+
What changes between regex and regex-lite
2+
=========================================
3+
4+
Performance loss
5+
----------------
6+
7+
Refer to the [Benchmark](#Benchmark) section for specific numbers.
8+
9+
Artifact size reduction
10+
-----------------------
11+
12+
Refer to the [PR comment](https://github.com/DataDog/libdatadog/pull/1232#issuecomment-3318665873) to evaluate the impact.
13+
14+
Unicode correctness
15+
-------------------
16+
17+
This one needs more detail since it's not just numbers. Basically, unicode edge-cases (especially around multi codepoint characters) contributed a lot to the complexity (and thus size of the automatas).
18+
19+
### What regex-lite still does for Unicode
20+
21+
Both engines fundamentally match Unicode scalar values (code points) in &str haystacks; . matches a full code point (not a single byte) unless you explicitly disable Unicode in regex. regex-lite therefore can match arbitrary non-ASCII characters literally—it just lacks the higher-level Unicode semantics of the following sections.
22+
23+
```rs
24+
// Literal Unicode still works in both
25+
assert!(regex_lite::Regex::new("ΔδΔ").unwrap().is_match("xxΔδΔyy"));
26+
```
27+
28+
29+
### No Unicode properties (\p{...} / \P{...})
30+
31+
`regex`: Supports Unicode general categories, scripts, script extensions, ages and many boolean properties via `\p{...}` / `\P{...}` (e.g., `\p{Letter}`, `\p{Greek}`, `\p{Emoji}`, `\p{Age:6.0}`), and lets you combine them inside classes.
32+
33+
regex-lite: Does not support `\p{...}`/`\P{...}` at all (patterns using them won't compile).
34+
35+
```rs
36+
// regex: OK – matches all Greek letters
37+
let re = regex::Regex::new(r"\p{Greek}+").unwrap();
38+
assert!(re.is_match("ΔδΔ"));
39+
40+
// regex-lite: compile error – \p{...} unsupported
41+
let re = regex_lite::Regex::new(r"\p{Greek}+").unwrap(); // ERROR
42+
```
43+
44+
### ASCII-only "Perl classes" (`\w`, `\d`, `\s`) and word boundaries
45+
46+
regex: In Unicode mode (default), `\w`, `\d`, `\s` are Unicode-aware; `\b`/`\B` use Unicode's notion of "word" characters. ASCII-only variants are opt-in via (?-u:...).
47+
48+
regex-lite: `\w`, `\d`, `\s` are ASCII only (`\w` = [0-9A-Za-z_], `\d` = [0-9], `\s` = [`\t`n``v``f``r ``]). Since `\w` is ASCII-only, word boundaries behave accordingly (i.e., effectively ASCII).
49+
50+
```rs
51+
// \w on non-ASCII letters
52+
assert!(regex::Regex::new(r"^\w+$").unwrap().is_match("résumé")); // true (Unicode-aware)
53+
assert!(!regex_lite::Regex::new(r"^\w+$").unwrap().is_match("résumé"));// false (ASCII-only)
54+
55+
// \d on non-ASCII digits (e.g., Devanagari '३')
56+
assert!(regex::Regex::new(r"^\d$").unwrap().is_match("")); // true
57+
assert!(!regex_lite::Regex::new(r"^\d$").unwrap().is_match("")); // false
58+
59+
// \b word boundary with non-ASCII letters
60+
assert!(regex::Regex::new(r"\bword\b").unwrap().is_match("… wordًا …")); // true
61+
assert!(!regex_lite::Regex::new(r"\bword\b").unwrap().is_match("… wordًا …")); // often false
62+
```
63+
64+
### No Unicode-aware case-insensitive matching
65+
66+
regex: (`?i`) is Unicode-aware (uses Unicode "simple case folding"). E.g., Δ matches δ under (?i).
67+
68+
regex-lite: (`?i`) is ASCII-only; non-ASCII letters won't fold.
69+
70+
```rs
71+
assert!(regex::Regex::new(r"(?i)Δ+").unwrap().is_match("ΔδΔ")); // true
72+
assert!(!regex_lite::Regex::new(r"(?i)Δ+").unwrap().is_match("ΔδΔ")); // false
73+
```
74+
75+
### No Unicode-centric character-class set ops beyond union
76+
77+
regex: Inside \[...\], supports intersection &&, difference --, symmetric difference ~~, and nested classes—very handy with Unicode properties (e.g., Greek letters only).
78+
79+
regex-lite: Only union is supported; &&, --, ~~ are not.
80+
81+
```rs
82+
// regex: Greek letters only (Greek ∩ Letter)
83+
let re = regex::Regex::new(r"[\p{Greek}&&\pL]+").unwrap(); // OK
84+
// regex-lite: ERROR – intersection unsupported, and \p{…} unsupported
85+
let re = regex_lite::Regex::new(r"[\p{Greek}&&\pL]+").unwrap(); // ERROR
86+
```
87+
88+
### No "Unicode Perl classes" feature or Unicode word-boundary tables
89+
90+
regex: Its Unicode feature set includes dedicated data for Unicode-aware `\w`, `\s`, `\d` and for Unicode word-boundary logic; these are part of its documented Unicode features.
91+
92+
regex-lite: Opts out of "robust Unicode support" entirely; there are no Unicode data tables enabling those behaviors.
93+
94+
```rs
95+
// Unicode whitespace (e.g., NO-BREAK SPACE \u{00A0})
96+
assert!(regex::Regex::new(r"\s").unwrap().is_match("\u{00A0}")); // true
97+
assert!(!regex_lite::Regex::new(r"\s").unwrap().is_match("\u{00A0}")); // false
98+
```
99+
1100
Benchmark
2101
=========
3102

4103
```sh
5104
$ cargo bench
6105
```
7106

8-
Results:
107+
```
108+
Finished `bench` profile [optimized] target(s) in 0.00s
109+
Running unittests src/lib.rs (target/release/deps/cgroup_parse_bench-c1b1b28a53195300)
110+
111+
running 18 tests
112+
test benches::cdefine_hand_long ... bench: 1,356,066.66 ns/iter (+/- 117,849.80)
113+
test benches::cdefine_hand_med ... bench: 1,329,366.68 ns/iter (+/- 132,407.79)
114+
test benches::cdefine_hand_short ... bench: 1,286,426.56 ns/iter (+/- 125,846.09)
115+
test benches::cdefine_regex_lite_long ... bench: 21,279,454.20 ns/iter (+/- 549,742.88)
116+
test benches::cdefine_regex_lite_med ... bench: 21,216,983.30 ns/iter (+/- 721,263.72)
117+
test benches::cdefine_regex_lite_short ... bench: 20,854,745.80 ns/iter (+/- 783,203.29)
118+
test benches::cdefine_regex_long ... bench: 2,671,409.38 ns/iter (+/- 69,159.82)
119+
test benches::cdefine_regex_med ... bench: 2,654,479.15 ns/iter (+/- 97,009.58)
120+
test benches::cdefine_regex_short ... bench: 2,668,466.60 ns/iter (+/- 67,595.87)
121+
test benches::cgroup_hand_long ... bench: 1,281,364.55 ns/iter (+/- 57,103.09)
122+
test benches::cgroup_hand_med ... bench: 1,121,480.21 ns/iter (+/- 41,114.21)
123+
test benches::cgroup_hand_short ... bench: 1,095,742.20 ns/iter (+/- 44,502.66)
124+
test benches::cgroup_regex_lite_long ... bench: 256,230,283.40 ns/iter (+/- 3,334,367.42)
125+
test benches::cgroup_regex_lite_med ... bench: 114,994,416.60 ns/iter (+/- 1,492,667.53)
126+
test benches::cgroup_regex_lite_short ... bench: 57,620,791.60 ns/iter (+/- 1,935,595.32)
127+
test benches::cgroup_regex_long ... bench: 20,438,225.10 ns/iter (+/- 797,679.19)
128+
test benches::cgroup_regex_med ... bench: 10,140,350.00 ns/iter (+/- 607,504.16)
129+
test benches::cgroup_regex_short ... bench: 5,910,920.80 ns/iter (+/- 278,048.02)
130+
131+
test result: ok. 0 passed; 0 failed; 0 ignored; 18 measured; 0 filtered out; finished in 197.21s
132+
```

0 commit comments

Comments
 (0)