-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: stopping the crawlers gracefully with BasicCrawler.stop()
#2792
Merged
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
284274b
feat: stopping the crawlers gracefully with `BasicCrawler.stop()`
barjin 55210ec
feat: make `BasicCrawler.stop` synchronous (parity w/ Python)
barjin d5e469a
chore: formatting
barjin e538881
chore: type refactoring
barjin 42cd320
chore: copy in error messages
barjin e7388de
docs: mention `crawler.stop` in `keepAlive` docstring
barjin e62befc
chore: add e2e test
barjin d72105e
Merge branch 'master' into feat/crawler-stop
barjin 05b43b9
chore: add e2e test (cheerio-stop-resume)
barjin 1611602
Merge branch 'master' into feat/crawler-stop
barjin 8e9ea84
chore: move test to `-ts` folder to enable TypeScript
barjin 287580c
Merge branch 'master' into feat/crawler-stop
barjin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
{ | ||
"actorSpecification": 1, | ||
"name": "test-cheerio-stop-resume-ts", | ||
"version": "0.0", | ||
"buildTag": "latest", | ||
"env": null | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
{ | ||
"root": true, | ||
"extends": "../../.eslintrc.json", | ||
"parserOptions": { | ||
"project": "./test/e2e/cheerio-stop-resume-ts/actor/tsconfig.json", | ||
"ecmaVersion": 2022 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
.idea | ||
.DS_Store | ||
node_modules | ||
package-lock.json | ||
apify_storage | ||
crawlee_storage | ||
storage | ||
main.d.ts | ||
main.d.ts.map | ||
main.js | ||
main.js.map |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# using multistage build, as we need dev deps to build the TS source code | ||
FROM apify/actor-node:20-beta AS builder | ||
|
||
# copy all files, install all dependencies (including dev deps) and build the project | ||
COPY . ./ | ||
RUN npm install --include=dev \ | ||
&& npm run build | ||
|
||
# create final image | ||
FROM apify/actor-node:20-beta | ||
# copy only necessary files | ||
COPY --from=builder /usr/src/app/packages ./packages | ||
COPY --from=builder /usr/src/app/package.json ./ | ||
COPY --from=builder /usr/src/app/main.js ./ | ||
|
||
# install only prod deps | ||
RUN npm --quiet set progress=false \ | ||
&& npm install --only=prod --no-optional --no-audit \ | ||
&& npm update --no-audit \ | ||
&& echo "Installed NPM packages:" \ | ||
&& (npm list --only=prod --no-optional --all || true) \ | ||
&& echo "Node.js version:" \ | ||
&& node --version \ | ||
&& echo "NPM version:" \ | ||
&& npm --version | ||
|
||
# run compiled code | ||
CMD npm run start:prod |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
import { CheerioCrawler, Dataset } from '@crawlee/cheerio'; | ||
import { Actor } from 'apify'; | ||
|
||
if (process.env.STORAGE_IMPLEMENTATION === 'LOCAL') { | ||
// @ts-ignore | ||
await Actor.init({ storage: new (await import('@apify/storage-local')).ApifyStorageLocal() }); | ||
} else { | ||
await Actor.init(); | ||
} | ||
|
||
let requestCount = 0; | ||
|
||
const crawler = new CheerioCrawler(); | ||
crawler.router.addDefaultHandler(async ({ $, enqueueLinks, request, log }) => { | ||
const { url } = request; | ||
await enqueueLinks({ | ||
globs: ['https://crawlee.dev/docs/**'], | ||
}); | ||
|
||
const pageTitle = $('title').first().text(); | ||
log.info(`URL: ${url} TITLE: ${pageTitle}`); | ||
await Dataset.pushData({ url, pageTitle }); | ||
|
||
if (requestCount++ > 10) crawler.stop(); | ||
}); | ||
|
||
await crawler.run(['https://crawlee.dev/docs/quick-start']); | ||
|
||
requestCount = 0; | ||
await crawler.run(['https://crawlee.dev/docs/quick-start'], { purgeRequestQueue: false }); | ||
await Actor.exit({ exit: Actor.isAtHome() }); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
{ | ||
"name": "test-cheerio-stop-resume-ts", | ||
"version": "0.0.1", | ||
"description": "Crawler Stop-Resume Test - TypeScript", | ||
"dependencies": { | ||
"apify": "next", | ||
"@apify/storage-local": "^2.1.3", | ||
"@crawlee/basic": "file:./packages/basic-crawler", | ||
"@crawlee/browser-pool": "file:./packages/browser-pool", | ||
"@crawlee/http": "file:./packages/http-crawler", | ||
"@crawlee/cheerio": "file:./packages/cheerio-crawler", | ||
"@crawlee/core": "file:./packages/core", | ||
"@crawlee/memory-storage": "file:./packages/memory-storage", | ||
"@crawlee/types": "file:./packages/types", | ||
"@crawlee/utils": "file:./packages/utils" | ||
}, | ||
"overrides": { | ||
"apify": { | ||
"@crawlee/core": "file:./packages/core", | ||
"@crawlee/types": "file:./packages/types", | ||
"@crawlee/utils": "file:./packages/utils" | ||
} | ||
}, | ||
"devDependencies": { | ||
"@apify/tsconfig": "^0.1.0", | ||
"typescript": "^5.0.0" | ||
}, | ||
"scripts": { | ||
"start": "tsc && node main.js", | ||
"start:prod": "node main.js", | ||
"build": "tsc" | ||
}, | ||
"type": "module", | ||
"license": "ISC" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"extends": "@apify/tsconfig", | ||
"compilerOptions": { | ||
"module": "ES2022", | ||
"target": "ES2022", | ||
"lib": ["DOM"] | ||
}, | ||
"include": ["./**/*.ts"] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
import { initialize, getActorTestDir, runActor, expect } from '../tools.mjs'; | ||
|
||
const testActorDirname = getActorTestDir(import.meta.url); | ||
await initialize(testActorDirname); | ||
|
||
const { stats, datasetItems } = await runActor(testActorDirname); | ||
|
||
/// Some extra requests are expected (at most 10 extra for each run). | ||
await expect(stats.requestsFinished < 40, 'crawler.stop() works'); | ||
|
||
const visitedUrls = new Set(datasetItems.map((x) => x.url)); | ||
await expect(visitedUrls.size === datasetItems.length, 'stateful crawler.run({ purgeRQ: false }) works'); |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the motivation for not making the method async and ignoring the promise instead? feels handy to be able to await it (and users can always decide themselves to ignore the promise, unlike with the current solution)
also I don't think you need the
async
in hereThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I planned it like this in the beginning, see 284274b
The reasons for moving to this was:
a. parity w/ Python version
b.
^ this looks like a completely harmless piece of code, but will cause deadlock.
requestHandler
is waiting forautoscaledPool.pause
, which is waiting for all tasks to finish (including the aforementionedrequestHandler
).You and me both, but
@typescript-eslint/promise-function-async
thinks otherwise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I'm definitely for making this
async
, but we need to address these somehow. Also, note that even with the implementation from 284274b ,crawler.stop
resolves onceautoscaledPool.abort()
has resolved, but that (afaik) doesn't mean thecrawler.run()
has resolved.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure that there is a good reason for an
async stop()
instead of non-async one. If you are expected to call it from a request handler, waiting for the crawler to stop there leads to a deadlock and the reasons are quite intuitive.Should someone need to wait for the crawler to finish outside of the request handler, they can always wait for the promise from
crawler.run()
to resolve.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always a bit wary about floating promises, but as long as you guys are fine with it (and there is the catch clause to stop it from killing the whole process), let's see how it goes.
I'll still add the promised tests from #2803 , just to stay on the safer side.