All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Update jquery version to 3.5.1.
- Update lodash version to 4.17.20.
- Update puppeteer version to 1.20.0.
- Update request version to 2.88.2.
- Update request-promise version to 4.2.6.
- Update @types/lodash version to 4.14.162.
- Update @types/puppeteer version to 1.20.0.
- Update @types/request-promise version to 4.1.46.
- Fix
crawler.response
returningnull
when connecting to specific chrome instance #354. - Fix crawler failure to follow urls with
#
hashes in them #332. - Fix crawler pending indefinitely when mixed content is present #260.
- Fix: 🔒 high-severity lodash vulnerability #339.
- Fix: 🔒 update jquery and lodash to fix Prototype Pollution vulnerability.
- Fix: 🔒 update puppeteer to fix Use After Free vulnerability #350.
- Fix: 🔒 update JQuery to fix XSS vulnerability
- Set
previousUrl
toonSuccess
argument. - Set
options
,depth
,previousUrl
to errors. - Support
customCrawl
for HCCrawler.connect() and HCCrawler.launch()'s options. - Add Dockerfile and tips for using Docker.
- Drop
newpage
event. - Update Puppeteer version to 1.5.0.
- Fix a bug of not marking skipped requests correctly.
- Fix
requestfinished
event's argument as described in the API reference.
- Support
cookies
for crawler.queue()'s options. - Make
onSuccess
passcookies
in the response.
- Update Puppeteer version to 1.4.0.
- Support
viewport
andskipRequestedRedirect
for crawler.queue()'s options. - Emit
requestdisallowed
event. - Make
onSuccess
passredirectChain
in the response.
- Bump Node.js version up to 8.10.0.
- Update Puppeteer version to 1.3.0.
- Move node_redis to the peer dependencies.
- Make crawler.queue() to return Promise.
- Fix a bug of silently failing to insert jQuery due to CSP.
- Support
waitFor
for crawler.queue()'s options. - Support
slowMo
for HCCrawler.connect()'s options.
- Fix a bug of not allowed to set
timeout
option per request. - Fix a bug of crawling twice if one url has a trailing slash on the root folder and the other does not.
- Support
browserCache
for crawler.queue()'s options. - Support
depthPriority
option again.
- Drop
depthPriority
for crawler.queue()'s options.
- Emit
newpage
event. - Support
deniedDomains
anddepthPriority
for crawler.queue()'s options.
- Allow
allowedDomains
option to accept a list of regular expressions.
- Support
followSitemapXml
for crawler.queue()'s options.
- Fix a bug of not showing console message properly.
- Fix a bug of listing response properties as methods.
- Fix a bug of not obeying robots.txt.
- Add HCCrawler.defaultArgs() method.
- Emit
requestretried
event.
- Use
cache
option not only for remembering already requested URLs but for request queue for distributed environments. - Moved
onSuccess
,onError
andmaxDepth
options from HCCrawler.connect() and HCCrawler.launch() to crawler.queue().
- Support
obeyRobotsTxt
for crawler.queue()'s options. - Support
persist
for RedisCache's constructing options.
- Make
cache
to be required for HCCrawler.connect() and HCCrawler.launch()'s options. - Provide
skipDuplicates
to remember and skip duplicate URLs, instead of passingnull
tocache
option. - Modify
BaseCache
interface.
- Support CSV and JSON Lines formats for exporting results
- Emit
requeststarted
,requestskipped
,requestfinished
,requestfailed
,maxdepthreached
,maxrequestreached
anddisconnected
events. - Improve debug logs by tracing public APIs and events.
- Allow
onSuccess
andevaluatePage
options asnull
. - Change
crawler.isPaused
,crawler.queueSize
,crawler.pendingQueueSize
andcrawler.requestedCount
from read-only properties to methods.
- Fix a bug of ignoring maxDepth option.
- Refactor by changing tye style of requiring cache directory.
- Fix a bug of starting too many crawlers more than maxConcurrency when requests fail.
- Automatically collect and follow links found in the requested page.
- Support
maxDepth
for crawler.queue()'s options.
- Support
screenshot
for crawler.queue()'s options.
- Rename
ensureCacheClear
topersistCache
for HCCrawler.connect() and HCCrawler.launch()'s options.
- Support
maxRequest
for HCCrawler.connect() and HCCrawler.launch()'s options. - Support
allowedDomains
anduserAgent
for crawler.queue()'s options. - Support pluggable cache such as SessionCache, RedisCache and BaseCache interface for customizing caches.
- Add crawler.setMaxRequest(), crawler.pause() and crawler.resume() methods.
- Add crawler.pendingQueueSize and crawler.requestedCount read-only properties.
- Add CHANGELOG.md based on Keep a Changelog.
- Add unit tests.
- Automatically dismisses dialog.
- Performance improvement by setting a page parallel.
- Support
extraHeaders
for crawler.queue()'s options. - Add comment in JSDoc style.
- Public API to launch a browser has changed. Now you can launch browser by HCCrawler.launch().
- Rename
shouldRequest
topreRequest
for crawler.queue()'s options. - Refactor by separating
HCCrawler
andCrawler
classes. - Refactor handlers for options.
- Add test with mocha and power-assert.
- Add coverage with istanbul.
- Add setting for CircleCI.
- Add .editorconfig.
- Add debug log.
- Migrate from NPM to Yarn.
- Refactor helper to class static method style.