Skip to content

Strengthen check for UTF-8 conformity in formatContent()#704

Merged
k00ni merged 1 commit into
smalot:masterfrom
GreyWyvern:strict-utf8
Apr 29, 2024
Merged

Strengthen check for UTF-8 conformity in formatContent()#704
k00ni merged 1 commit into
smalot:masterfrom
GreyWyvern:strict-utf8

Conversation

@GreyWyvern

Copy link
Copy Markdown
Contributor

Type of pull request

  • Bug fix (involves code and configuration changes)

About

In some cases a binary string may pass as valid UTF-8 to the mb_check_encoding(..., 'UTF-8') function. Use a comprehensive regexp from the W3 group instead to be certain we aren't trying to parse binary content in formatContent(). In addition to (strings), also check for the beginning of ID inline image content sections, which may also contain binary. Resolves #668.

Reference: https://www.w3.org/International/questions/qa-forms-utf-8.en

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

  • Please add at least one test case (unit test, system test, ...) to demonstrate that the change is working. If existing code was changed, your tests cover these code parts as well.
  • Please run PHP-CS-Fixer before committing, to confirm with our coding styles. See https://github.com/smalot/pdfparser/blob/master/.php-cs-fixer.php for more information about our coding styles.
  • In case you fix an existing issue, please do one of the following:
    • Write in this text something like fixes #1234 to outline that you are providing a fix for the issue #1234.

In some cases a binary string may pass as valid UTF-8 to the `mb_check_encoding(..., 'UTF-8')` function. Use a comprehensive regexp from the W3 group instead to be **certain** we aren't trying to parse binary content in `formatContent()`.

Reference: https://www.w3.org/International/questions/qa-forms-utf-8.en

@k00ni k00ni left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GreyWyvern I appreciate you took the time to improve code readability, by extending the comments and moving the preg_replace call out of the if-clause. Thanks!

@k00ni k00ni merged commit 14adf31 into smalot:master Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

preg_match(): compilation failed: regular expression is too large to offset 143690

2 participants