Skip to content

Commit 557ff32

Browse files
authored
Merge pull request #1 from kynx/fix-word-separation
Fix word separation with snake_case
2 parents 8066ba5 + 84df929 commit 557ff32

File tree

8 files changed

+228
-110
lines changed

8 files changed

+228
-110
lines changed

README.md

Lines changed: 25 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,18 +7,18 @@ Utilities for generating PHP code.
77

88
## Normalizers
99

10-
The normalizers generate PHP labels (class names, namespaces, property names, etc) from valid UTF-8 strings,
10+
The normalizers generate readable PHP labels (class names, namespaces, property names, etc) from valid UTF-8 strings,
1111
[transliterating] them to ASCII and spelling out any invalid characters.
1212

1313
### Usage:
1414

15-
The following code (forgive the Japanese - a certain translation tool tells me it means "Pet Shop"):
15+
The following code (forgive the Japanese - a certain translation tool tells me it means "Pet Store"):
1616
```php
1717
<?php
1818

1919
use Kynx\CodeUtls\ClassNameNormalizer;
2020

21-
$normalizer = new ClassNameNormalizer();
21+
$normalizer = new ClassNameNormalizer('Controller');
2222
$namespace = $normalizer->normalize('ペット \ ショップ');
2323
echo $namespace;
2424
```
@@ -48,23 +48,34 @@ See the [tests] for more examples.
4848

4949
### Why?
5050

51-
You should never generate code from untrusted user input. But there are a few cases where you may want to do it with
52-
mostly-trusted input. In my case, it's generating classes and properties from an OpenAPI specification, where there are
53-
no restrictions on the characters present.
51+
You must **never** run code generated from untrusted user input. But there are a few cases where you do want to
52+
_output_ code generated from (mostly) trusted input.
53+
54+
In my case, I need to generate classes and properties from an OpenAPI specification. There are no hard-and-fast rules
55+
on the characters present, just a vague "it is RECOMMENDED to follow common programming naming conventions". Whatever
56+
they are.
5457

5558
### How?
5659

57-
`AbstractNormalizer` uses `ext-intl`'s [Transliterator] to perform the transliteration. Where a character has no
60+
Each normalizer uses `ext-intl`'s [Transliterator] to turn the UTF-8 string into Latin-ASCII. Where a character has no
5861
equivalent in ASCII (the "€" symbol is a good example), it uses the [Unicode name] of the character to spell it out (to
59-
`Euro`). For ASCII characters that are not valid in a PHP label, it provides it's own spell outs: for instance, a
60-
backtick "`" becomes `Backtick`.
62+
`Euro`, after some minor clean-up). For ASCII characters that are not valid in a PHP label, it provides its own spell
63+
outs. For instance, a backtick "&#96;" becomes `Backtick`.
64+
65+
Initial digits are also spelt out: "123foo" becomes `OneTwoThreeFoo`. Finally reserved words are suffixed with a
66+
user-supplied string so they don't mess things up. In the first usage example above, if we normalized "class" it would
67+
become `ClassController`.
68+
69+
The results may not be pretty. If for some mad reason your input contains ` ͖` - put your glasses on! - the label will
70+
contain `CombiningRightArrowheadAndUpArrowheadBelow`. But it _is_ valid PHP, and stands a chance of being as unique as
71+
the original. But speaking of which...
6172

62-
Initial digits are also spelt out - "123 foo" becomes `OneTwoThreeFoo` - and finally reserved words are suffixed with a
63-
user-supplied string so they don't mess things up: "class" can become `ClassController`.
73+
### Uniqueness
6474

65-
The results may not be pretty. For instance, if your input contains ` ͖` - put your glasses on! - the class name will
66-
contain `CombiningRightArrowheadAndUpArrowheadBelow`. But it _is_ valid PHP, and stands a good chance of being as unique
67-
as the original.
75+
The normalization process reduces around a million Unicode code points down to just 162 ASCII characters. Then we mangle
76+
it further by stripping separators, reducing whitespace and turning it into camelCase, snake_case or whatever
77+
your programming preference. It's gonna be lossy - nothing we can do about that. Ideally this library would provide a
78+
utility for guaranteeing uniqueness across a set of labels, but I haven't written it yet. Feel free to contribute!
6879

6980

7081
[transliterating]: https://unicode-org.github.io/icu/userguide/transforms/general/#script-transliteration

phpunit.xml.dist

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,10 @@
2323
<directory suffix=".php">src</directory>
2424
</include>
2525
</coverage>
26+
27+
<php>
28+
<!-- Seems to be needed by CI's PHP8.2-RC1? Not needed in PHP8.2-dev locally! -->
29+
<ini name="assert.exception" value="1" />
30+
<ini name="assert.warning" value="0" />
31+
</php>
2632
</phpunit>

src/AbstractNormalizer.php

Lines changed: 82 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,11 @@
99
use IntlCodePointBreakIterator;
1010
use Transliterator;
1111

12+
use function array_filter;
1213
use function array_map;
1314
use function array_shift;
15+
use function array_slice;
1416
use function assert;
15-
use function count;
1617
use function explode;
1718
use function implode;
1819
use function in_array;
@@ -27,7 +28,6 @@
2728
use function strtolower;
2829
use function substr;
2930
use function trim;
30-
use function ucfirst;
3131

3232
/**
3333
* Utility for generating valid PHP labels from UTF-8 strings
@@ -132,67 +132,66 @@ abstract class AbstractNormalizer implements NormalizerInterface
132132
];
133133

134134
private const ASCII_SPELLOUT = [
135-
1 => 'StartOfHeader',
136-
2 => 'StartOfText',
137-
3 => 'EndOfText',
138-
4 => 'EndOfTransmission',
135+
1 => 'Start Of Header',
136+
2 => 'Start Of Text',
137+
3 => 'End Of Text',
138+
4 => 'End Of Transmission',
139139
5 => 'Enquiry',
140140
6 => 'Acknowledgement',
141141
7 => 'Bell',
142142
8 => 'Backspace',
143-
9 => 'HorizontalTab',
144-
10 => 'LineFeed',
145-
11 => 'VerticalTab',
146-
12 => 'FormFeed',
147-
13 => 'CarriageReturn',
148-
14 => 'ShiftOut',
149-
15 => 'ShiftIn',
150-
16 => 'DataLinkEscape',
151-
17 => 'DeviceControlOne',
152-
18 => 'DeviceControlTwo',
153-
19 => 'DeviceControlThree',
154-
20 => 'DeviceControlFour',
155-
21 => 'NegativeAcknowledgement',
156-
22 => 'SynchronousIdle',
157-
23 => 'EndOfTransmissionBlock',
143+
9 => 'Horizontal Tab',
144+
10 => 'Line Feed',
145+
11 => 'Vertical Tab',
146+
12 => 'Form Feed',
147+
13 => 'Carriage Return',
148+
14 => 'Shift Out',
149+
15 => 'Shift In',
150+
16 => 'Data Link Escape',
151+
17 => 'Device Control One',
152+
18 => 'Device Control Two',
153+
19 => 'Device Control Three',
154+
20 => 'Device Control Four',
155+
21 => 'Negative Acknowledgement',
156+
22 => 'Synchronous Idle',
157+
23 => 'End Of Transmission Block',
158158
24 => 'Cancel',
159-
25 => 'EndOfMedium',
159+
25 => 'End Of Medium',
160160
26 => 'Substitute',
161161
27 => 'Escape',
162-
28 => 'FileSeparator',
163-
29 => 'GroupSeparator',
164-
30 => 'RecordSeparator',
165-
31 => 'UnitSeparator',
166-
32 => 'Space',
162+
28 => 'File Separator',
163+
29 => 'Group Separator',
164+
30 => 'Record Separator',
165+
31 => 'Unit Separator',
167166
33 => 'Exclamation',
168-
34 => 'DoubleQuote',
167+
34 => 'Double Quote',
169168
35 => 'Number',
170169
36 => 'Dollar',
171170
37 => 'Percent',
172171
38 => 'Ampersand',
173172
39 => 'Quote',
174-
40 => 'OpenBracket',
175-
41 => 'CloseBracket',
173+
40 => 'Open Bracket',
174+
41 => 'Close Bracket',
176175
42 => 'Asterisk',
177176
43 => 'Plus',
178177
44 => 'Comma',
179-
46 => 'FullStop',
178+
46 => 'Full Stop',
180179
47 => 'Slash',
181180
58 => 'Colon',
182181
59 => 'Semicolon',
183-
60 => 'LessThan',
182+
60 => 'Less Than',
184183
61 => 'Equals',
185-
62 => 'GreaterThan',
186-
63 => 'QuestionMark',
184+
62 => 'Greater Than',
185+
63 => 'Question Mark',
187186
64 => 'At',
188-
91 => 'OpenSquare',
187+
91 => 'Open Square',
189188
92 => 'Backslash',
190-
93 => 'CloseSquare',
189+
93 => 'Close Square',
191190
94 => 'Caret',
192191
96 => 'Backtick',
193-
123 => 'OpenCurly',
192+
123 => 'Open Curly',
194193
124 => 'Pipe',
195-
125 => 'CloseCurly',
194+
125 => 'Close Curly',
196195
126 => 'Tilde',
197196
127 => 'Delete',
198197
];
@@ -252,30 +251,36 @@ protected function toAscii(string $string): string
252251
return $this->spellOutNonAscii(implode(' ', $words));
253252
}
254253

255-
protected function separatorsToUnderscore(string $string): string
254+
protected function separatorsToSpace(string $string): string
256255
{
257-
return preg_replace('/[' . $this->separators . '\s]+/', '_', trim($string));
256+
return preg_replace('/[' . $this->separators . '\s_]+/', ' ', trim($string));
258257
}
259258

260259
protected function spellOutAscii(string $string): string
261260
{
262-
$chunks = str_split($string);
263-
$last = count($chunks) - 1;
264-
foreach (str_split($string) as $i => $char) {
265-
if (isset(self::ASCII_SPELLOUT[ord($char)])) {
266-
$char = self::ASCII_SPELLOUT[ord($char)] . ($i < $last ? '_' : '');
261+
$speltOut = [];
262+
$current = '';
263+
264+
foreach (str_split($string) as $char) {
265+
$ord = ord($char);
266+
if (! isset(self::ASCII_SPELLOUT[$ord])) {
267+
$current .= $char;
268+
continue;
267269
}
268-
$chunks[$i] = $char;
270+
271+
$speltOut[] = $current;
272+
$speltOut[] = self::ASCII_SPELLOUT[$ord];
273+
$current = '';
269274
}
275+
$speltOut[] = $current;
270276

271-
return $this->spellOutLeadingDigits(implode('', $chunks));
277+
return $this->spellOutLeadingDigits(implode(' ', $speltOut));
272278
}
273279

274280
protected function toCase(string $string): string
275281
{
276-
assert(in_array($this->case, self::VALID_CASES));
277-
278-
$parts = explode('_', $string);
282+
/** @var list<string> $parts */
283+
$parts = array_filter(explode(' ', $string));
279284
return match ($this->case) {
280285
self::CAMEL_CASE => $this->toCamelCase($parts),
281286
self::PASCAL_CASE => $this->toPascalCase($parts),
@@ -284,11 +289,11 @@ protected function toCase(string $string): string
284289
};
285290
}
286291

287-
protected function sanitizeReserved(string $string, array $reserved): string
292+
protected function sanitizeReserved(string $string): string
288293
{
289294
assert($this->suffix !== null);
290295

291-
if (in_array(strtolower($string), $reserved, true)) {
296+
if (in_array(strtolower($string), self::RESERVED, true)) {
292297
return $string . $this->suffix;
293298
}
294299
return $string;
@@ -297,10 +302,10 @@ protected function sanitizeReserved(string $string, array $reserved): string
297302
private function prepareSuffix(string|null $suffix, string $case): string|null
298303
{
299304
if ($suffix === null) {
300-
return $suffix;
305+
return null;
301306
}
302307

303-
if ($suffix === '' || ! preg_match('/^[a-zA-Z0-9_\x80-\xff]*$/', $suffix)) {
308+
if (! preg_match('/^[a-zA-Z0-9_\x80-\xff]+$/', $suffix)) {
304309
throw NormalizerException::invalidSuffix($suffix);
305310
}
306311

@@ -312,46 +317,53 @@ private function prepareSuffix(string|null $suffix, string $case): string|null
312317

313318
private function spellOutNonAscii(string $string): string
314319
{
315-
$speltOut = '';
320+
$speltOut = [];
321+
$current = '';
316322

317323
$this->codePoints->setText($string);
318324
/** @var string $char */
319325
foreach ($this->codePoints->getPartsIterator() as $char) {
320-
$ord = IntlChar::ord($char);
321-
$speltOut .= $ord < 256 ? $char : $this->spellOutNonAsciiChar($ord);
326+
$ord = IntlChar::ord($char);
327+
if ($ord < 256) {
328+
$current .= $char;
329+
continue;
330+
}
331+
332+
$speltOut[] = $current;
333+
$speltOut[] = $this->spellOutNonAsciiChar($ord);
334+
$current = '';
322335
}
336+
$speltOut[] = $current;
323337

324-
return $speltOut;
338+
return implode(' ', $speltOut);
325339
}
326340

327341
private function spellOutNonAsciiChar(int $ord): string
328342
{
329343
$speltOut = IntlChar::charName($ord);
330344

331-
// 'EURO SIGN' -> 'Euro'
332-
return implode('', array_map(function (string $part): string {
333-
return $part === 'SIGN' ? '' : ucfirst(strtolower($part));
334-
}, explode(" ", $speltOut)));
345+
// 'EURO SIGN' -> 'euro'
346+
return implode(' ', array_map(function (string $part): string {
347+
return $part === 'SIGN' ? '' : strtolower($part);
348+
}, explode(' ', $speltOut)));
335349
}
336350

337351
private function spellOutLeadingDigits(string $string): string
338352
{
339-
$chunks = str_split($string);
353+
$speltOut = [];
354+
$chunks = str_split($string);
340355
foreach ($chunks as $i => $char) {
341-
if ($i > 1 && $char === '_') {
342-
$chunks[$i] = '';
343-
break;
344-
}
345-
346356
$ord = ord($char);
357+
347358
if (! isset(self::DIGIT_SPELLOUT[$ord])) {
359+
$speltOut[] = implode('', array_slice($chunks, $i));
348360
break;
349361
}
350362

351-
$chunks[$i] = self::DIGIT_SPELLOUT[$ord] . '_';
363+
$speltOut[] = self::DIGIT_SPELLOUT[$ord];
352364
}
353365

354-
return implode('', $chunks);
366+
return implode(' ', $speltOut);
355367
}
356368

357369
/**

src/ClassNameNormalizer.php

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,11 @@ public function normalize(string $label): string
3636

3737
private function normalizeLabel(string $label): string
3838
{
39-
$ascii = $this->toAscii($label);
40-
$underscored = $this->separatorsToUnderscore($ascii);
41-
$speltOut = $this->spellOutAscii($underscored);
42-
$cased = $this->toCase($speltOut);
39+
$ascii = $this->toAscii($label);
40+
$spaced = $this->separatorsToSpace($ascii);
41+
$speltOut = $this->spellOutAscii($spaced);
42+
$cased = $this->toCase($speltOut);
4343

44-
return $this->sanitizeReserved($cased, self::RESERVED);
44+
return $this->sanitizeReserved($cased);
4545
}
4646
}

src/ConstantNameNormalizer.php

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@ public function __construct(
2222
*/
2323
public function normalize(string $label): string
2424
{
25-
$ascii = $this->toAscii($label);
26-
$underscored = $this->separatorsToUnderscore($ascii);
27-
$speltOut = $this->spellOutAscii($underscored);
28-
$cased = $this->toCase($speltOut);
25+
$ascii = $this->toAscii($label);
26+
$spaced = $this->separatorsToSpace($ascii);
27+
$speltOut = $this->spellOutAscii($spaced);
28+
$cased = $this->toCase($speltOut);
2929

30-
return $this->sanitizeReserved($cased, self::RESERVED);
30+
return $this->sanitizeReserved($cased);
3131
}
3232
}

src/PropertyNameNormalizer.php

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ public function __construct(string $case = self::CAMEL_CASE, string $separators
2121
*/
2222
public function normalize(string $label): string
2323
{
24-
$ascii = $this->toAscii($label);
25-
$underscored = $this->separatorsToUnderscore($ascii);
26-
$speltOut = $this->spellOutAscii($underscored);
24+
$ascii = $this->toAscii($label);
25+
$spaced = $this->separatorsToSpace($ascii);
26+
$speltOut = $this->spellOutAscii($spaced);
2727

2828
return $this->toCase($speltOut);
2929
}

0 commit comments

Comments
 (0)