- ...
- ...
- ...
- Updated to Ruby 3.3 and updated production dependencies including Wgit (v0.11)
- Added
--js
and--js-delay
flag options to the executable. This allows JS parsing to update a page's DOM before it get crawled.
- ...
- ...
- Support for Ruby 3.
- Removed support for Ruby 2.5 (as it's too old).
- ...
BrokenLinkFinder::link_xpath
andlink_xpath=
methods so you can customise how links are extracted from each crawled page using the API.- An
--xpath
(or just-x
) command line flag so you can customise how links are extracted when using the command line.
- Changed the default way in which links are extracted from a page. Previously any element with a
href
orsrc
attribute was extracted and checked; now only those links inside the<body>
are extracted and checked, ignoring the<head>
section entirely. You can change this behaviour back with:BrokenLinkFinder::link_xpath = '//*/@href | //*/@src'
before you perform a crawl. Alternatively, if using the command line, use the--xpath //*/@href | //*/@src
option.
- Scheme relative bug by upgrading to
wgit v0.10.0
.
- ...
- Updated wgit gem to version 0.9.0 which contains improvements and bugs fixes.
- ...
- Additional crawl statistics.
- Exit code handling to executable.
0
for success,1
for an error scenario.
- Updated the report formats slightly bringing various improvements such as the total number of links crawled etc.
- Bug in html report, summary url is now an
<a>
link. - Bug in
Finder@broken_link_map
URLs andFinder#crawl_stats[:url]
URL during redirects. - Bug causing an error on crawling unparsable/invalid URL's.
- A
--html
flag to thecrawl
executable command which produces a HTML report (instead of text). - Added a 'retry' mechanism for any broken links found. This is essentially a verification step before generating a report.
Finder#crawl_stats
for info such as crawl duration, total links crawled etc.
- The API has changed somewhat. See the docs for the up to date code signatures if you're using
broken_link_finder
outside of its executable.
- ...
- ...
- Now using optimistic dep versioning.
- Updated
wgit
to version 0.5.1 containing improvements and bug fixes.
- ...
- ...
- Updated
wgit
gem to version 0.5.0 which contains improvements and bugs fixes.
- ...
- ...
- ...
- A bug resulting in some servers dropping crawl requests from broken_link_finder.
- ...
- Updated
wgit
gem to version 0.4.0 which brings a speed boost to crawls.
- ...
BrokenLinkFinder::Finder.crawl_site
alias:crawl_r
.
- Upgraded
wgit
to v0.2.0. - Refactored the code base (no breaking changes).
- ...
- The
version
command to the executable. - The
--threads
aka-t
option to the executable'scrawl
command to control crawl speed vs. resource usage.
- Changed the default number of maximum threads for a recursive crawl from 30 to 100. Users will see a speed boost with increased resource usage as a result. This is configurable using the new
crawl
command option e.g.--threads 30
.
- Several bugs by updating the
wgit
dependancy. - A bug in the report logic causing an incorrect link count.
- ...
- ...
- Updated
wgit
dep containing bug fixes.
- Logic to prevent re-crawling links for more efficiency.
- Updated the
wgit
gem which fixes a bug incrawl_site
and adds support for IRI's.
- Bug where an error from the executable wasn't being rescued.
- Added the
--verbose
flag to the executable for displaying all ignored links. - Added the
--concise
flag to the executable for displaying the broken links in summary form. - Added the
--sort-by-link
flag to the executable for displaying the broken links found and the pages containing that link (as opposed to sorting by page by default).
- Changed the default sorting (format) for ignored links to be summarised (much more concise) reducing noise in the reports.
- Updated the
README.md
to reflect the new changes.
- Bug where the broken/ignored links weren't being ordered consistently between runs. Now, all links are reported alphabetically. This will change existing report formats.
- Bug where an anchor of
#
was being returned as broken when it shouldn't.
- Support for ignored links e.g. mailto's, tel's etc. The README has been updated.
- Only HTML files now have their links verified, JS files for example, do not have their contents checked. This also boosts crawl speed.
- Links are now reported exactly as they appear in the HTML (for easier location after reading the reports).
- Links with anchors aren't regarded as separate pages during a crawl anymore, thus removing duplicate reports.
- Anchor support is now included meaning the response HTML must include an element with an ID matching that of the anchor in the link's URL; otherwise, it's regarded as broken. Previously, there was no anchor support.
- The README now includes a How It Works section detailing what constitutes a broken link. See this for more information.
- Any element with a href or src attribute is now regarded as a link. Before it was just
<a>
elements.
- ...