Skip to content

Commit 7691e49

Browse files
committed
cli: include \w, \s and \d in Unicode data table generation
This was an oversight omission when porting the old generator shell script to regex-cli. This hasn't been an issue because I don't think we've generated data for a new release of Unicode with this new infrastructure yet. This was flagged by unit tests that failed because \d was no longer a subset of \w.
1 parent b790aa5 commit 7691e49

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

regex-cli/cmd/generate/unicode.rs

+17
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,23 @@ USAGE:
8484
gen(d.join("sentence_break.rs"), &["sentence-break", &ucd, "--chars"])?;
8585
gen(d.join("word_break.rs"), &["word-break", &ucd, "--chars"])?;
8686

87+
// These generate the \w, \d and \s Unicode-aware character classes for
88+
// regex-syntax. \d and \s are technically part of the general category
89+
// and boolean properties generated above. However, these are generated
90+
// separately to make it possible to enable or disable them via Cargo
91+
// features independently of whether all boolean properties or general
92+
// categories are enabled or disabled. The crate ensures that only one copy
93+
// is compiled.
94+
gen(d.join("perl_word.rs"), &["perl-word", &ucd, "--chars"])?;
95+
gen(
96+
d.join("perl_decimal.rs"),
97+
&["general-category", &ucd, "--chars", "--include", "decimalnumber"],
98+
)?;
99+
gen(
100+
d.join("perl_space.rs"),
101+
&["property-bool", &ucd, "--chars", "--include", "whitespace"],
102+
)?;
103+
87104
// Data tables for regex-automata.
88105
let d = out
89106
.join("regex-automata")

0 commit comments

Comments
 (0)