Skip to content

Releases: CryShana/CryCrawler

Hotfix for URL matching bug

11 Sep 17:02
Compare
Choose a tag to compare
  • bug was fixed where files wouldn't be matched if no URL patterns were specified

Global crawl delay and small improvements

16 Aug 08:38
dfef029
Compare
Choose a tag to compare

You can now set a global crawl delay in seconds in the configuration file.

Small improvements include:

  • Package version changed so the --version shows the correct version now
  • More logs in DEBUG mode to display why some URLs were skipped
  • Seed URLs will be reloaded regardless if they were crawled or not - if the backlog is empty - this should fix the annoyance of having to delete the cache every time either a crawl fails or backlog isn't properly saved

Blacklisted URL patterns

15 Aug 18:38
9fdcbca
Compare
Choose a tag to compare

You can now define a list of URL patterns to be blacklisted - similar to v1.0.3 with File URL pattern matching.

Unlike the file URL pattern matching where only file URLs are compared to patterns and accepted if they match any pattern - similar to a whitelist. This blacklist applies to all URLs, not only file URLs. Rules are the same.

File URL pattern matching

15 Aug 17:54
8ec30de
Compare
Choose a tag to compare

Implemented new file criteria - ability to filter out files based on file URL.

You can now define a list of URL patterns in the configuration file or using Web GUI.

Example URL patterns include:
somedomain.com/image/*
/original/*
/page/*/anotherpage/*

Beware that /image/ will only match URLs ending with /image/ and not /image/somethingelse or /image. This is why it is recommended to always add a * at the end.

Blacklist fixed

14 Aug 13:46
Compare
Choose a tag to compare

This release contains a hotfix for the blacklist issue where blacklisted subdomains got skipped if the main domain was whitelisted. Blacklist is now checked before the whitelist.

Robots.txt functionality

01 Aug 20:22
Compare
Choose a tag to compare
  • added option to set User-Agent from config.js and WebGUI
  • added option to respect robots.txt on websites

Initial release

31 Jul 14:02
Compare
Choose a tag to compare
v1.0

function rename