Skip to content

Conversation

@bytestream
Copy link
Contributor

When provided with a large HTML document (over a million characters) the Core.AggressivelyFixLt regex results in catastrophic backtracking and $html = null being returned. TLDR; HTMLPurifier gives you back a null document...

I tried many times to produce a regex which did not suffer from catastrophic backtracking but I think it ultimately comes back to the argument of why you should not use regex to parse HTML. The only solutions I could come up with were to either:

  • Increase pcre.backtrack_limit to a higher value
  • Disable Core.AggressivelyFixLt but that's sub-optimal given the approach seems to work on documents of a reasonable size...
  • Handle the null return value from preg_replace_callback and return $html (disable armor logic if a regex error occurs)

The solution in this PR uses a little algorithm which employs only standard string manipulation functions so it works incredibly fast. The algorithm searches for HTML comments and allows a callback to be ran on them.

I've not messed with the signatures of the callbackUndoCommentSubst and callbackArmorCommentEntities functions because they're public and might be used by other libraries.

@bytestream
Copy link
Contributor Author

Any thoughts @ezyang ?

@bytestream
Copy link
Contributor Author

@ezyang please could you review?

@ezyang
Copy link
Owner

ezyang commented Jun 1, 2025

You caught me at an unlucky time as I went on parental leave when you submitted this PR. It's on my queue

@ezyang ezyang merged commit 418eeb7 into ezyang:master Jun 6, 2025
11 checks passed
github-actions bot pushed a commit that referenced this pull request Oct 17, 2025
# [4.19.0](v4.18.0...v4.19.0) (2025-10-17)

### Bug Fixes

* add warning for misleading option ([#433](#433)) ([b21a591](b21a591))
* catastrophic backtracking in Core.AggressivelyFixLt ([#440](#440)) ([418eeb7](418eeb7))
* Deprecated: preg_replace(): Passing null to parameter [#3](#3) ($subject) o… ([#421](#421)) ([5d154a2](5d154a2))
* non-substantive typos ([#434](#434)) ([c2bc354](c2bc354))

### Features

* Add CSS direction support ([#429](#429)) ([63e631e](63e631e))
* Add option for safe iframe hosts using array lookup ([#423](#423)) ([b5cbf0c](b5cbf0c))
* Allow more image widths by default ([#430](#430)) ([00a0748](00a0748))
* Define option URI.AllowedSymbols ([#447](#447)) ([77ebd08](77ebd08))
* PHP 8.4 support ([#441](#441)) ([ff005f6](ff005f6))
* Support PHP 8.5 versions ([#453](#453)) ([1eb05d9](1eb05d9))
@github-actions
Copy link

🎉 This PR is included in version 4.19.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants