Skip to content

[php2cpg] Try multiple charsets when reading file content#5222

Merged
max-leuthaeuser merged 3 commits intomasterfrom
johannes/php-file-encoding
Jan 14, 2025
Merged

[php2cpg] Try multiple charsets when reading file content#5222
max-leuthaeuser merged 3 commits intomasterfrom
johannes/php-file-encoding

Conversation

@johannescoetzee
Copy link
Contributor

There's a customer issue where reading the file content was resulting in a MalformedInputException being thrown. With this PR, we just try all of the Java standard charsets before emitting a warning instead of crashing: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/charset/StandardCharsets.html

I would be very surprised if we ever need the specific endianness UTF-16 variants, but since this only tries encodings until something works, I also don't think adding them to the list has much of a downside.

@max-leuthaeuser
Copy link
Contributor

max-leuthaeuser commented Jan 13, 2025

I think thats why we have https://github.com/ShiftLeftSecurity/codepropertygraph/blob/master/codepropertygraph/src/main/scala/io/shiftleft/utils/IOUtils.scala

It does not try multiple encodings but does some replacing and BOM fixing. Also unpaired surrogates are a thing. Had all of this in JS customer code already.

@johannescoetzee johannescoetzee force-pushed the johannes/php-file-encoding branch from 1b08f8c to 6ee5aa4 Compare January 13, 2025 10:47
"""function foo() {
| // ⦝
| $x = "🙂⨌🙂𐇐🙂🙂🙂🙂";
| $x = "??⨌????????????";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using IOUtils.readEntireFile introduces a regression in the characters we can actually parse, but I suspect it's unlikely we'll encounter many such characters in actual code and is outweighed by the benefit of using a uniform approach across frontends

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, stange. It works here: io.joern.jssrc2cpg.io.CodeDumperFromContentTests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the same results when pasting that string into one of the jssrc tests:

...
  x = 'foo[??⨌????????????]';
}" was not equal to "...ass Foo
{
  x = 'foo[🙂⨌🙂𐇐🙂🙂🙂🙂]';
...

I did a hex dump of a file containing only 🙂 and it matches the UTF-8 encoding from https://www.compart.com/en/unicode/U+1F642

❯ hexdump /tmp/smile
0000000 9ff0 8299 000a                         
0000005
❯ cat /tmp/smile
🙂

nvim could've done some conversions on write or such though. If not, it looks like it should be a valid UTF-8 character

Copy link
Contributor

@max-leuthaeuser max-leuthaeuser Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its time to move IOUtils to x2cpg (which wasnt a thing back then) and fix the utf8 handling. I'll do a draft PR.

Hm, can't do this. semanticcpg also uses that and can't depend on x2cpg (cyclic dependencies).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this: ShiftLeftSecurity/codepropertygraph#1804 the test case works again.
I was not able to (re-) discover some code with actual unpaired surrogates (which was the reason I implemented it in first place). Not really sure on how to proceed there.

Copy link
Contributor

@maltek maltek Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this table here help with crafting such a file? https://simonsapin.github.io/wtf-8/#surrogates-byte-sequences

I'm pretty sure we only added the surrogate replacement because of issues we had with a customer file.

Copy link
Contributor

@max-leuthaeuser max-leuthaeuser Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! We should add some tests for that. Just not sure how to craft it exactly and if it makes sense to feed such a thing into our parsers at the end.

@johannescoetzee johannescoetzee force-pushed the johannes/php-file-encoding branch from cd2abfd to 3a57833 Compare January 13, 2025 15:51
@johannescoetzee
Copy link
Contributor Author

Upgrading to ShiftLeftSecurity/codepropertygraph#1804 fixed the test. Thanks @max-leuthaeuser!

@max-leuthaeuser max-leuthaeuser merged commit a922084 into master Jan 14, 2025
5 checks passed
@max-leuthaeuser max-leuthaeuser deleted the johannes/php-file-encoding branch January 14, 2025 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants