fix: Fix `BasicCrawler` statistics persistence #1490

Pijukatel · 2025-10-15T12:23:23Z

Description

Ensure that BasicCrawler is persisting statistics by default.
Ensure that BasicCrawler is recovering existing statistics by default if Configuration.purge_on_start is False.
Let the BasicCrawler emit Event.PERSIST_STATE when finishing.

Issues

Closes: Fix Crawler on migration not remembering statistics #1501

Testing

Added unit test
Tested on SDK level: Update test to included statistics before reboot apify-sdk-python#629

…tent state init)

vdusek

I'm surprised we use the SDK_CRAWLER_STATISTICS_... key for state persistence. Why is the SDK prefix in Crawlee? Also, since this is internal, we use a double-underscore prefix (__STORAGE_ALIASES_MAPPING, __RQ_STATE_...) for other cases. Could we update the key name, please?

vdusek · 2025-10-17T12:43:53Z

tests/unit/test_configuration.py

-    crawler = HttpCrawler(
-        configuration=configuration,
-        storage_client=storage_client,
-    )
+    service_locator.set_configuration(configuration)
+    service_locator.set_storage_client(storage_client)
+
+    crawler = HttpCrawler()


This is because RecoverableState of statistics persists to/recovers from global storage_client. And since statistics is persisted by default now, it will try to persist to default global service_client, which is FileSystem... regardless of the crawler-specific storage_client

Mentioned here:
#1438 (comment)

I am open to discussion about this.

Couldn't we use the storage client passed to the crawler?

We could, but do we want to? I had an inconclusive discussion about this with @janbuchar
I am still not sure about this.

Honestly, I'm kinda disappointed with the amount of edge cases that arose from having a separate service locator for crawlers.

From a "common sense" perspective, the RecoverableState is owned by the crawler and it doesn't make much sense to put the serialized state in a different storage (the global one). Then again, there's a good chance that the crawler-wide storage client will be a memory storage, which is not a great fit for RecoverableState.

But, unless I'm missing something, it should be super rare that somebody will do this intentionally. In my opinion, we should pick one of these options and just show a warning if both the global and crawler-specific storage client are configured.

src/crawlee/crawlers/_basic/_basic_crawler.py

TODO: Figure out reason for stats difference in request_total_finished_duration

src/crawlee/crawlers/_basic/_basic_crawler.py

vdusek

LGTM

janbuchar

LGTM. Please look at my comment about service locators and do whatever you deem appropriate.

Pijukatel added 3 commits October 2, 2025 15:38

WIp, fix failing tests

7dd0f66

Add comment for remembering where left

f3b9812

Start _crawler_state_rec_task when active contexts (to allow persis…

3a28d42

…tent state init)

github-actions bot assigned Pijukatel Oct 15, 2025

github-actions bot added this to the 125th sprint - Tooling team milestone Oct 15, 2025

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Oct 15, 2025

Test Windows related issue

ebc350d

Pijukatel force-pushed the crawler-persistance branch from 55e7316 to ebc350d Compare October 15, 2025 14:31

Pijukatel added 2 commits October 16, 2025 11:26

Fix test on windows

52daca0

Merge remote-tracking branch 'origin/master' into crawler-persistance

b2b4724

Pijukatel requested review from janbuchar and vdusek October 16, 2025 13:27

Pijukatel marked this pull request as ready for review October 16, 2025 13:27

vdusek requested changes Oct 17, 2025

View reviewed changes

Review comments

675dadf

TODO: Figure out reason for stats difference in request_total_finished_duration

vdusek reviewed Oct 20, 2025

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

Add _crawler_state_rec_task to other context managers

37f6473

Pijukatel requested a review from vdusek October 20, 2025 09:35

vdusek approved these changes Oct 22, 2025

View reviewed changes

janbuchar changed the title ~~fix: Fix BasicCrawler statistics persistance~~ fix: Fix BasicCrawler statistics persistence Oct 22, 2025

janbuchar approved these changes Oct 22, 2025

View reviewed changes

Persist Crawler statistics to Crawler KVS

00e65ec

Pijukatel force-pushed the crawler-persistance branch from 7ba7190 to 00e65ec Compare October 23, 2025 09:11

Pijukatel merged commit 1eb1c19 into master Oct 23, 2025
19 checks passed

Pijukatel deleted the crawler-persistance branch October 23, 2025 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: Fix `BasicCrawler` statistics persistence #1490

fix: Fix `BasicCrawler` statistics persistence #1490

Pijukatel commented Oct 15, 2025 •

edited

Loading

Uh oh!

vdusek left a comment

Uh oh!

vdusek Oct 17, 2025

Uh oh!

Pijukatel Oct 17, 2025

Uh oh!

vdusek Oct 20, 2025

Uh oh!

Pijukatel Oct 20, 2025

Uh oh!

janbuchar Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Uh oh!

janbuchar left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

fix: Fix BasicCrawler statistics persistence #1490

fix: Fix BasicCrawler statistics persistence #1490

Conversation

Pijukatel commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: Fix `BasicCrawler` statistics persistence #1490

fix: Fix `BasicCrawler` statistics persistence #1490

Pijukatel commented Oct 15, 2025 •

edited

Loading