Attention: This package is outdated and is no longer maintained. Use PHP Web Crawler instead.
This php class allows you to crawl recursively a given webpage (or a given html file) and collect some data from it. Simply define the url (or a html file) and a set of xpath expressions which should map with the output data object. The final representation will be a php array which can be easily converted into the json format for further processing.
user$ git clone git@github.com:bjoern-hempel/php-web-crawler.git .TODO...
user$ php examples/simple.php
{
"version": "1.0.0",
"title": "Test Title",
"paragraph": "Test Paragraph"
}2.1 Basic usage simple.php (simple html page)
<html>
<head>
<title>Test Page</title>
</head>
<body>
<h1>Test Title</h1>
<p>Test Paragraph</p>
</body>
</html><?php
include dirname(__FILE__).'/../autoload.php';
use Ixno\WebCrawler\Output\Field;
use Ixno\WebCrawler\Value\Text;
use Ixno\WebCrawler\Value\XpathTextnode;
use Ixno\WebCrawler\Source\File;
$file = dirname(__FILE__).'/html/basic.html';
$html = new File(
$file,
new Field('version', new Text('1.0.0')),
new Field('title', new XpathTextnode('//h1')),
new Field('paragraph', new XpathTextnode('//p'))
);
$data = json_encode($html->parse(), JSON_PRETTY_PRINT);
print_r($data);
echo "\n";It returns:
{
"version": "1.0.0",
"title": "Test Title",
"paragraph": "Test Paragraph"
}- examples/simple-wiki-page.php
- examples/group.php
- examples/section.php
- examples/sections.php
- examples/url.php
TODO...
TODO...
TODO...
user$ phpunit tests/Basic.php
PHPUnit 7.0.2 by Sebastian Bergmann and contributors.
.. 2 / 2 (100%)
Time: 126 ms, Memory: 8.00MB
OK (2 tests, 16 assertions)Using the autoloader function of the composer it is possible to use this classes without including the source files.
Make some changes to your composer.json:
"autoload": {
"psr-0": {
...
"Ixno\\WebCrawler\\":"vendor/ixno/webcrawler/",
...
}
},Add this project to your vendor directory:
user$ cd /path/to/root/of/project
user$ mkdir vendor/ixno/webcrawler && cd vendor/ixno/webcrawler
user$ git clone git@github.com:bjoern-hempel/php-web-crawler.git . && cd ../../..Call the composer to create the composer autoloading mappings:
user$ composer.phar dumpautoload -oCheck the result:
user$ grep -r Ixno vendor/composer/.You will something like the following lines:
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\Converter\\Converter' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Converter/Converter.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\Converter\\DateParser' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Converter/DateParser.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\CrawlRule' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\Crawler' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\Page' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\PageGroup' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php: 'Ixno\\WebCrawler\\PageList' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
...Now you can simply use all classes without including the source files.
- Björn Hempel - Initial work - Björn Hempel
This project is licensed under the MIT License - see the LICENSE file for details
Have fun! :)