Using ZeroTrie for property parser #5576

robertbastian · 2024-09-23T16:33:59Z

components/properties/src/names.rs

robertbastian · 2024-09-24T12:35:15Z

Manishearth · 2024-09-24T17:17:04Z

As mentioned this morning, not a huge fan of the recursion and a bit wary that it can cause stack overflows with malicious data, perhaps we can have some recursion limit applied, probably by limiting the length of the input string to something reasonable.

sffc · 2024-09-24T18:01:19Z

components/properties/src/names.rs

+            skip_cursor.step(skip);
+            if let Some(r) = recurse(skip_cursor, name) {
+                return Some(r);
+            }


Nit: It's probably faster to check for None in the return value of .step in order to avoid the extra recursion call. In most cases the step will be None.

.step does not have a return value though. Moving the empty check in front of every recursive call complicates the logic and I'm not convinced there is much of a cost.

Oh, it's only ZeroAsciiIgnoreCaseTrie that returns a value in .step. I would approve a two-line change to make ZeroTrieSimpleAsciiCursor also return a value in .step.

It could return Option<u8>, Option<()>, or bool

sffc · 2024-09-24T18:01:42Z

components/properties/src/names.rs

+        }
+
+        // Skip whitespace, underscore, hyphen in trie.
+        for skip in [b'\t', b'\n', b'\x0C', b'\r', b' ', 0x0B, b'_', b'-'] {


Question: do these characters like \t and \n actually occur in the trie? Seems unlikely?

Not in today's compiled data, but according to the spec this is what we have to do.

Checking the cursor for these characters is fairly cheap, and recursion only happens if a character is actually found.

Citation in the spec?

You say "fairly cheap", but this is sort-of a hot path (for example, regex and unicode set parsing), and every one of these requires a function call, and function calls are not free. My guess is that these extra function calls together make the function about 2x as slow. I could be wrong.

I also find it incredibly unlikely that the UCD would add a canonical property name containing characters like \t and \n. I understand skipping those in the user's string, but not in the trie string.

components/properties/src/names.rs

sffc

I have concerns in the recurse function, but they can be addressed in a follow-up. I think overall this is the right direction, and we can make it faster.

sffc · 2024-09-25T15:59:10Z

components/properties/src/names.rs

+        }
+
+        // Skip whitespace, underscore, hyphen in trie.
+        for skip in [b'\t', b'\n', b'\x0C', b'\r', b' ', 0x0B, b'_', b'-'] {


Citation in the spec?

You say "fairly cheap", but this is sort-of a hot path (for example, regex and unicode set parsing), and every one of these requires a function call, and function calls are not free. My guess is that these extra function calls together make the function about 2x as slow. I could be wrong.

I also find it incredibly unlikely that the UCD would add a canonical property name containing characters like \t and \n. I understand skipping those in the user's string, but not in the trie string.

sffc reviewed Sep 23, 2024

View reviewed changes

components/properties/src/names.rs Outdated Show resolved Hide resolved

This comment was marked as spam.

Sign in to view

sffc mentioned this pull request Sep 24, 2024

Add ZeroAsciiIgnoreCaseTrie::get_strict #5585

Draft

robertbastian added 2 commits September 24, 2024 09:15

data

38e6949

zerotrie

0ddcc6d

robertbastian force-pushed the propapi branch from e9be7c2 to 0ddcc6d Compare September 24, 2024 07:16

loose

a6d593b

robertbastian marked this pull request as ready for review September 24, 2024 08:18

robertbastian requested review from Manishearth and a team as code owners September 24, 2024 08:18

test

2cd552e

robertbastian requested a review from sffc September 24, 2024 08:29

sffc requested changes Sep 24, 2024

View reviewed changes

rm unwrap

5be70f2

robertbastian requested a review from sffc September 25, 2024 08:11

sffc approved these changes Sep 25, 2024

View reviewed changes

robertbastian merged commit a89dc0a into unicode-org:main Sep 25, 2024
28 checks passed

sffc mentioned this pull request Sep 25, 2024

Make property name lookup get_loose more efficient and limit recursion depth #5599

Open

robertbastian deleted the propapi branch October 17, 2024 00:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ZeroTrie for property parser #5576

Using ZeroTrie for property parser #5576

robertbastian commented Sep 23, 2024 •

edited

Loading

This comment was marked as spam.

robertbastian commented Sep 24, 2024

Manishearth commented Sep 24, 2024

sffc Sep 24, 2024

robertbastian Sep 25, 2024

sffc Sep 25, 2024 •

edited

Loading

sffc Sep 24, 2024

robertbastian Sep 25, 2024

robertbastian Sep 25, 2024

sffc Sep 25, 2024

sffc left a comment

sffc Sep 25, 2024

Using ZeroTrie for property parser #5576

Using ZeroTrie for property parser #5576

Conversation

robertbastian commented Sep 23, 2024 • edited Loading

This comment was marked as spam.

robertbastian commented Sep 24, 2024

Manishearth commented Sep 24, 2024

sffc Sep 24, 2024

Choose a reason for hiding this comment

robertbastian Sep 25, 2024

Choose a reason for hiding this comment

sffc Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

sffc Sep 24, 2024

Choose a reason for hiding this comment

robertbastian Sep 25, 2024

Choose a reason for hiding this comment

robertbastian Sep 25, 2024

Choose a reason for hiding this comment

sffc Sep 25, 2024

Choose a reason for hiding this comment

sffc left a comment

Choose a reason for hiding this comment

sffc Sep 25, 2024

Choose a reason for hiding this comment

robertbastian commented Sep 23, 2024 •

edited

Loading

sffc Sep 25, 2024 •

edited

Loading