Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address infinite loops in Transliterator runtime #3974

Open
skius opened this issue Aug 30, 2023 · 3 comments
Open

Address infinite loops in Transliterator runtime #3974

skius opened this issue Aug 30, 2023 · 3 comments
Labels
C-transliterator Component: transliterator C-unicode Component: Props, sets, tries

Comments

@skius
Copy link
Member

skius commented Aug 30, 2023

Even if we were able to throw away all non-terminating transliterators during datagen (we can't), corrupted/malicious data could still cause infinite loops in the runtime.

I know of these ways that could happen:

  • The individual conversion rules themselves define a non-terminating transliterator, e.g., a > a|a ;. ICU4C/J solves this by applying a maximum target length of 16x the input (see here).
  • A rule based transliterator could recursively call itself, i.e., Any-Example contains the rule :: Any-Example ; (or calls a nested transliterator that contains that rule). This might be more tricky to catch at runtime
    • Another way transliterators can call themselves is through function calls, i.e., (a) > &Any-Example($1); - this is not guaranteed to be an infinite loop, though, as the recursive call only happens if the LHS is matched.
  • VarTable lookup - corrupted data could contain a cyclic VarTable, i.e., $a = $b ; $b = $a if that were to parse without errors (I catch this during datagen).
@skius skius added the C-unicode Component: Props, sets, tries label Aug 30, 2023
@robertbastian
Copy link
Member

Are the first two cases checked at datagen time?

I guess not terminating is like panicking, because eventually we'll probably run out of memory and cause an allocator panic. So we have to guard against it even at runtime.

@skius
Copy link
Member Author

skius commented Aug 30, 2023

Are the first two cases checked at datagen time?

No. Here are my thoughts:

Regarding :: Any-Example ; - we could add special cases, but I just realized that with filters, i.e., :: [filter] Any-Example ;, it's again not always a guaranteed loop. We can special case easy to detect ones (like a cycle with absent filters every step of the way), but I think the cost-benefit tradeoff is not in our favor. Because of corrupted data this has to be (assuming we want to do something against it) implemented dynamically on the runtime side again anyway. (see my update below)

Regarding a > a|a ; - similar thoughts as above, but there are so many more special cases to consider that I am even more strongly of the opinion that we shouldn't do anything at datagen time about conversion rules. Compare with the Rust compiler: what infinite loops are checked? There's the special loop and while true {}; and that's it. while 0 == 0 {}; gives zero warning from the (otherwise informative) compiler.

@skius
Copy link
Member Author

skius commented Aug 31, 2023

Another thought:
We could (should?) check VarTable recursion at construction time, even if it might cost a bit. A linked list (like struct Link<'a>(Option<(char, &'a Link)>)) can be used to non-allocatingly keep track of the visited elements using the call stack, even though repeated lookups would be quadratic.

Thinking about this some more (in particular how to avoid infinite loops in recursive transliterators), I changed my mind and I do think we should just disallow any form of cyclic recursion with transliterators, even if a specific case might not cause an infinite loop. This allows us to not have to do recursion checking during transliteration, but during construction as with VarTable. That also plays nicely with my suggestion in #3946 (comment), so we have yet another reason to do the slightly more expensive thing during construction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-transliterator Component: transliterator C-unicode Component: Props, sets, tries
Projects
None yet
Development

No branches or pull requests

3 participants