Description
A recently released paper titled Trojan Source: Invisible Vulnerabilities demonstrates an attack against source code. It uses Unicode bi-directional overrides to disguise the meaning of code to a human reader. This can lead to seemingly harmless code introducing malicious behaviour.
Crystal demonstration
The following code demonstrates a stretched-string attack in Crystal:
access_level = "user"
if access_level != "user # Check if admin "
puts "You are an admin!"
end
The following code demonstrates a commenting-out attack in Crystal:
access_level = "user"
if access_level != "none" # Check if admin" && access_level != "user
puts "You are an admin!"
end
They looks mostly unsuspicious. You wouldn't expect either to print anything. But both programs actually print You are an admin!
despite access_level = "user"
.
The second lines of each program's source code contain a number of Unicode control characters for bi-directional overrides. This is what the parser reads:
# stretched-string attack
if access_level != "user\u202E \u2066# Check if admin\u2069 \u2066"
# commenting-out attack
if access_level != "none\u202E\u2066"# Check if admin\u2069\u2066" && access_level != "user
The only indicator that something might be off is the syntax highlighting, which should be pretty resistant to being fooled.
Github has already introduced a feature that shows a warning when bi-directional overrides are detected in a file: https://github.blog/changelog/2021-10-31-warning-about-bidirectional-unicode-text/
Mitigation
This vulnerability can be defended easily by disallowing bi-directional control characters in source code.
In many locations, such control characters are already a syntax error. But they are currently valid in comments and string literals. Those are the typical spots for most languages.
However, Crystal's parser currently even accepts Unicode control characters in identifiers, including bi-directional override characters. Restricting the allowed character set in general is another problem and tracked in #11216.
I propose to change the language specification and lexer rules such that valid Crystal source code must not contain any bi-directional control characters, regardles of location.
A more fine-grained approach would be possible as well, but this should be unnecessary considering there are little to no legitimate use cases for bidirectional control characters in Crystal source code (but for some specific exceptions mentioned in the following section).
Workarounds
Bi-directional override characters are legitimate contents for string literals. Instead of encoding them directly in the source code, a proper workaround is to use escape sequences for that.
Bi-directional overrides can technically be legitimate in comments if you want to mix languages with different directions in the comment text. That does not seem like a very important use case, though.
Still, as a further enhancement, bi-directional overrides could potentially be allowed in comments and possibly other locations such as string literals as long as they are fully enclosed inside the comment or literal.
The general vulnerability is tracked as CVE-2021-42574.