Releases: CryShana/CryCrawler
Hotfix for URL matching bug
Global crawl delay and small improvements
You can now set a global crawl delay in seconds in the configuration file.
Small improvements include:
- Package version changed so the
--version
shows the correct version now - More logs in DEBUG mode to display why some URLs were skipped
- Seed URLs will be reloaded regardless if they were crawled or not - if the backlog is empty - this should fix the annoyance of having to delete the cache every time either a crawl fails or backlog isn't properly saved
Blacklisted URL patterns
You can now define a list of URL patterns to be blacklisted - similar to v1.0.3 with File URL pattern matching.
Unlike the file URL pattern matching where only file URLs are compared to patterns and accepted if they match any pattern - similar to a whitelist. This blacklist applies to all URLs, not only file URLs. Rules are the same.
File URL pattern matching
Implemented new file criteria - ability to filter out files based on file URL.
You can now define a list of URL patterns in the configuration file or using Web GUI.
Example URL patterns include:
somedomain.com/image/*
/original/*
/page/*/anotherpage/*
Beware that /image/
will only match URLs ending with /image/
and not /image/somethingelse
or /image
. This is why it is recommended to always add a *
at the end.
Blacklist fixed
This release contains a hotfix for the blacklist issue where blacklisted subdomains got skipped if the main domain was whitelisted. Blacklist is now checked before the whitelist.
Robots.txt functionality
- added option to set User-Agent from
config.js
and WebGUI - added option to respect
robots.txt
on websites
Initial release
v1.0 function rename