A Laravel package for cleaning and transforming HTML content. It provides a fluent interface to remove unwanted elements like CSS, scripts, and more, with options to preserve specific elements and even convert the cleaned HTML to Markdown.
- Remove CSS (inline styles and
<style>
blocks) - Remove JavaScript (inline scripts and
<script>
blocks) - Preserve allowed tags through a configurable list or helper methods
- Convert to Markdown for quick text transformations
- Custom Regex Patterns to remove specific parts of the HTML
- Whitespace Normalization with an option to preserve newlines
Install the package using Composer:
composer require cloudstudio/laravel-html-crawler
The package will automatically register itself in Laravel.
To publish the configuration file, run:
php artisan vendor:publish --provider="CloudStudio\HtmlCrawler\HtmlCrawlerServiceProvider"
By default, the package removes disallowed tags (for example, it will strip <div>
tags and any tags not explicitly allowed):
use CloudStudio\HtmlCrawler\Facades\HtmlCrawler;
$html = '<div><p>Hello <strong>World</strong></p></div>';
$cleanHtml = HtmlCrawler::fromHtml($html)->clean();
// Expected output: "Hello World"
You can explicitly specify which tags to preserve:
use CloudStudio\HtmlCrawler\Facades\HtmlCrawler;
$html = '<div><p>Hello <a href="#">World</a></p></div>';
$cleanHtml = HtmlCrawler::fromHtml($html)
->setAllowedTags(['p', 'a'])
->clean();
// Expected output: '<p>Hello <a href="#">World</a></p>'
The package offers helper methods to preserve groups of tags:
use CloudStudio\HtmlCrawler\Facades\HtmlCrawler;
$html = '<div><p>Hello <a href="#">World</a></p></div>';
$cleanHtml = HtmlCrawler::fromHtml($html)
->keepParagraphs() // Preserves <p> tags
->keepLinks() // Preserves <a> tags
->clean();
// Expected output: '<p>Hello <a href="#">World</a></p>'
By default, <script>
blocks are removed:
use CloudStudio\HtmlCrawler\Facades\HtmlCrawler;
$html = '<div><script>alert("x")</script><p>Test</p></div>';
$cleanHtml = HtmlCrawler::fromHtml($html)->clean();
// Expected output: "Test"
If you wish to keep <script>
blocks, use the keepScripts()
method:
use CloudStudio\HtmlCrawler\Facades\HtmlCrawler;
$html = '<div><script>alert("x")</script><p>Test</p></div>';
$cleanHtml = HtmlCrawler::fromHtml($html)
->keepScripts()
->clean();
// Expected output: '<script>alert("x")</script><p>Test</p>'
By default, <style>
blocks and CSS links are removed. To preserve them, use keepCss()
:
use CloudStudio\HtmlCrawler\Facades\HtmlCrawler;
$html = '<div><style>.text { color: red; }</style><p>Styled text</p></div>';
$cleanHtml = HtmlCrawler::fromHtml($html)
->keepCss()
->clean();
// Expected output: '<style>.text { color: red; }</style><p>Styled text</p>'
If you need to remove specific parts of the HTML using a regular expression:
use CloudStudio\HtmlCrawler\Facades\HtmlCrawler;
$html = '<div><span class="remove">Remove me</span><p>Keep me</p></div>';
$pattern = '/<span class="remove">.*?<\/span>/';
$cleanHtml = HtmlCrawler::fromHtml($html)
->useCustomPattern($pattern)
->clean();
// Expected output: '<p>Keep me</p>'
You can convert the cleaned HTML to Markdown:
use CloudStudio\HtmlCrawler\Facades\HtmlCrawler;
$html = '<h1>Title</h1><p>Paragraph text</p>';
$markdown = HtmlCrawler::fromHtml($html)
->withMarkdown()
->clean();
Control how newlines are handled in the HTML:
use CloudStudio\HtmlCrawler\Facades\HtmlCrawler;
$html = "Line 1\nLine 2";
$cleanHtml = HtmlCrawler::fromHtml($html)
->preserveNewlines(false) // Set to false to replace newlines with spaces
->clean();
// Expected output: "Line 1 Line 2"
You can also load HTML directly from a URL:
use CloudStudio\HtmlCrawler\Facades\HtmlCrawler;
$cleanHtml = HtmlCrawler::fromUrl('https://example.com')
->clean();
// Output: the cleaned HTML content retrieved from the URL.
The package includes a configuration file that allows you to define default options. After publishing the configuration file, you will find it at config/html-crawler.php
:
return [
'preserve_newlines' => true,
'allowed_tags' => [],
'convert_to_markdown' => false,
'remove_scripts' => true,
'remove_styles' => true,
];
You can modify these values according to your needs.
If you encounter the error:
BindingResolutionException: Target class [config] does not exist.
make sure your tests are running in a Laravel environment using orchestra/testbench. For package testing, install Testbench with:
composer require --dev orchestra/testbench
Then, set up your base test case to extend Testbench (see the package documentation for more details).
To run the tests, you can use:
./vendor/bin/pest
or if using PHPUnit:
./vendor/bin/phpunit
Please see the CHANGELOG for detailed information on recent changes.
Please refer to CONTRIBUTING for details on how to contribute to this package.
Please review our security policy on how to report security vulnerabilities.
This package is open-sourced software licensed under the MIT license.