This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.
Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.
You're free to use this package (it's MIT-licensed), but if it makes it to your production environment you are required to send us a postcard from your hometown, mentioning which of our package(s) you are using.
Our address is: Spatie, Samberstraat 69D, 2060 Antwerp, Belgium.
The best postcards will get published on the open source page on our website.
This package can be installed via Composer:
composer require spatie/crawler
The crawler can be instantiated like this
Crawler::create()
->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
->startCrawling($url);
The argument passed to setCrawlObserver
must be an object that implements the \Spatie\Crawler\CrawlObserver
interface:
/**
* Called when the crawler will crawl the given url.
*
* @param \Spatie\Crawler\Url $url
*/
public function willCrawl(Url $url);
/**
* Called when the crawler has crawled the given url.
*
* @param \Spatie\Crawler\Url $url
* @param \Psr\Http\Message\ResponseInterface $response
* @param \Spatie\Crawler\Url $foundOn
*/
public function hasBeenCrawled(Url $url, ResponseInterface $response, Url $foundOn);
/**
* Called when the crawl has ended.
*/
public function finishedCrawling();
You can tell the crawler not to visit certain urls by passing using the setCrawlProfile
-function. That function expects
an objects that implements the Spatie\Crawler\CrawlProfile
-interface:
/*
* Determine if the given url should be crawled.
*/
public function shouldCrawl(Url $url): bool;
This package comes with two CrawlProfiles
out of the box:
CrawlAllUrls
: this profile will crawl all urls on all pages including urls to an external site.CrawlInternalUrls
: this profile will only crawl the internal urls on the pages of a host.
To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency
method.
Crawler::create()
->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
->setConcurrency(1) //now all urls will be crawled one by one
->startCrawling($url);
Please see CHANGELOG for more information what has changed recently.
Please see CONTRIBUTING for details.
To run the tests you'll have to start the included node based server first in a separate terminal window.
cd tests/server
./start_server.sh
With the server running, you can start testing.
vendor/bin/phpunit
If you discover any security related issues, please email freek@spatie.be instead of using the issue tracker.
Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.
The MIT License (MIT). Please see License File for more information.