[php2cpg] Try multiple charsets when reading file content#5222
[php2cpg] Try multiple charsets when reading file content#5222max-leuthaeuser merged 3 commits intomasterfrom
Conversation
|
I think thats why we have https://github.com/ShiftLeftSecurity/codepropertygraph/blob/master/codepropertygraph/src/main/scala/io/shiftleft/utils/IOUtils.scala It does not try multiple encodings but does some replacing and BOM fixing. Also unpaired surrogates are a thing. Had all of this in JS customer code already. |
1b08f8c to
6ee5aa4
Compare
| """function foo() { | ||
| | // ⦝ | ||
| | $x = "🙂⨌🙂𐇐🙂🙂🙂🙂"; | ||
| | $x = "??⨌????????????"; |
There was a problem hiding this comment.
Using IOUtils.readEntireFile introduces a regression in the characters we can actually parse, but I suspect it's unlikely we'll encounter many such characters in actual code and is outweighed by the benefit of using a uniform approach across frontends
There was a problem hiding this comment.
Hm, stange. It works here: io.joern.jssrc2cpg.io.CodeDumperFromContentTests
There was a problem hiding this comment.
I get the same results when pasting that string into one of the jssrc tests:
...
x = 'foo[??⨌????????????]';
}" was not equal to "...ass Foo
{
x = 'foo[🙂⨌🙂𐇐🙂🙂🙂🙂]';
...
I did a hex dump of a file containing only 🙂 and it matches the UTF-8 encoding from https://www.compart.com/en/unicode/U+1F642
❯ hexdump /tmp/smile
0000000 9ff0 8299 000a
0000005
❯ cat /tmp/smile
🙂
nvim could've done some conversions on write or such though. If not, it looks like it should be a valid UTF-8 character
There was a problem hiding this comment.
Maybe its time to move IOUtils to x2cpg (which wasnt a thing back then) and fix the utf8 handling. I'll do a draft PR.
Hm, can't do this. semanticcpg also uses that and can't depend on x2cpg (cyclic dependencies).
There was a problem hiding this comment.
With this: ShiftLeftSecurity/codepropertygraph#1804 the test case works again.
I was not able to (re-) discover some code with actual unpaired surrogates (which was the reason I implemented it in first place). Not really sure on how to proceed there.
There was a problem hiding this comment.
does this table here help with crafting such a file? https://simonsapin.github.io/wtf-8/#surrogates-byte-sequences
I'm pretty sure we only added the surrogate replacement because of issues we had with a customer file.
There was a problem hiding this comment.
Interesting! We should add some tests for that. Just not sure how to craft it exactly and if it makes sense to feed such a thing into our parsers at the end.
cd2abfd to
3a57833
Compare
|
Upgrading to ShiftLeftSecurity/codepropertygraph#1804 fixed the test. Thanks @max-leuthaeuser! |
There's a customer issue where reading the file content was resulting in a
MalformedInputExceptionbeing thrown. With this PR, we just try all of the Java standard charsets before emitting a warning instead of crashing: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/charset/StandardCharsets.htmlI would be very surprised if we ever need the specific endianness UTF-16 variants, but since this only tries encodings until something works, I also don't think adding them to the list has much of a downside.