Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ZeroTrie for property parser #5576

Merged
merged 5 commits into from
Sep 25, 2024

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Sep 23, 2024

Fixes #4861

This comment was marked as spam.

@robertbastian robertbastian marked this pull request as ready for review September 24, 2024 08:18
@robertbastian
Copy link
Member Author

#3573

@Manishearth
Copy link
Member

As mentioned this morning, not a huge fan of the recursion and a bit wary that it can cause stack overflows with malicious data, perhaps we can have some recursion limit applied, probably by limiting the length of the input string to something reasonable.

Comment on lines 273 to 276
skip_cursor.step(skip);
if let Some(r) = recurse(skip_cursor, name) {
return Some(r);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's probably faster to check for None in the return value of .step in order to avoid the extra recursion call. In most cases the step will be None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.step does not have a return value though. Moving the empty check in front of every recursive call complicates the logic and I'm not convinced there is much of a cost.

Copy link
Member

@sffc sffc Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's only ZeroAsciiIgnoreCaseTrie that returns a value in .step. I would approve a two-line change to make ZeroTrieSimpleAsciiCursor also return a value in .step.

It could return Option<u8>, Option<()>, or bool

}

// Skip whitespace, underscore, hyphen in trie.
for skip in [b'\t', b'\n', b'\x0C', b'\r', b' ', 0x0B, b'_', b'-'] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do these characters like \t and \n actually occur in the trie? Seems unlikely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in today's compiled data, but according to the spec this is what we have to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the cursor for these characters is fairly cheap, and recursion only happens if a character is actually found.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Citation in the spec?

You say "fairly cheap", but this is sort-of a hot path (for example, regex and unicode set parsing), and every one of these requires a function call, and function calls are not free. My guess is that these extra function calls together make the function about 2x as slow. I could be wrong.

I also find it incredibly unlikely that the UCD would add a canonical property name containing characters like \t and \n. I understand skipping those in the user's string, but not in the trie string.

components/properties/src/names.rs Outdated Show resolved Hide resolved
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns in the recurse function, but they can be addressed in a follow-up. I think overall this is the right direction, and we can make it faster.

}

// Skip whitespace, underscore, hyphen in trie.
for skip in [b'\t', b'\n', b'\x0C', b'\r', b' ', 0x0B, b'_', b'-'] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Citation in the spec?

You say "fairly cheap", but this is sort-of a hot path (for example, regex and unicode set parsing), and every one of these requires a function call, and function calls are not free. My guess is that these extra function calls together make the function about 2x as slow. I could be wrong.

I also find it incredibly unlikely that the UCD would add a canonical property name containing characters like \t and \n. I understand skipping those in the user's string, but not in the trie string.

@robertbastian robertbastian merged commit a89dc0a into unicode-org:main Sep 25, 2024
28 checks passed
@robertbastian robertbastian deleted the propapi branch October 17, 2024 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Property name-to-enum mapper should use ZeroTrie internally
3 participants