Skip to content

Conversation

@Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Oct 15, 2025

Description

  • Ensure that BasicCrawler is persisting statistics by default.
  • Ensure that BasicCrawler is recovering existing statistics by default if Configuration.purge_on_start is False.
  • Let the BasicCrawler emit Event.PERSIST_STATE when finishing.

Issues

Testing

@github-actions github-actions bot added this to the 125th sprint - Tooling team milestone Oct 15, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Oct 15, 2025
@Pijukatel Pijukatel force-pushed the crawler-persistance branch from 55e7316 to ebc350d Compare October 15, 2025 14:31
@Pijukatel Pijukatel requested review from janbuchar and vdusek October 16, 2025 13:27
@Pijukatel Pijukatel marked this pull request as ready for review October 16, 2025 13:27
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we use the SDK_CRAWLER_STATISTICS_... key for state persistence. Why is the SDK prefix in Crawlee? Also, since this is internal, we use a double-underscore prefix (__STORAGE_ALIASES_MAPPING, __RQ_STATE_...) for other cases. Could we update the key name, please?

Comment on lines 44 to 47
crawler = HttpCrawler(
configuration=configuration,
storage_client=storage_client,
)
service_locator.set_configuration(configuration)
service_locator.set_storage_client(storage_client)

crawler = HttpCrawler()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because RecoverableState of statistics persists to/recovers from global storage_client. And since statistics is persisted by default now, it will try to persist to default global service_client, which is FileSystem... regardless of the crawler-specific storage_client

Mentioned here:
#1438 (comment)

I am open to discussion about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we use the storage client passed to the crawler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but do we want to? I had an inconclusive discussion about this with @janbuchar
I am still not sure about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm kinda disappointed with the amount of edge cases that arose from having a separate service locator for crawlers.

From a "common sense" perspective, the RecoverableState is owned by the crawler and it doesn't make much sense to put the serialized state in a different storage (the global one). Then again, there's a good chance that the crawler-wide storage client will be a memory storage, which is not a great fit for RecoverableState.

But, unless I'm missing something, it should be super rare that somebody will do this intentionally. In my opinion, we should pick one of these options and just show a warning if both the global and crawler-specific storage client are configured.

TODO: Figure out reason for stats difference in request_total_finished_duration
@Pijukatel Pijukatel requested a review from vdusek October 20, 2025 09:35
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@janbuchar janbuchar changed the title fix: Fix BasicCrawler statistics persistance fix: Fix BasicCrawler statistics persistence Oct 22, 2025
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please look at my comment about service locators and do whatever you deem appropriate.

@Pijukatel Pijukatel force-pushed the crawler-persistance branch from 7ba7190 to 00e65ec Compare October 23, 2025 09:11
@Pijukatel Pijukatel merged commit 1eb1c19 into master Oct 23, 2025
19 checks passed
@Pijukatel Pijukatel deleted the crawler-persistance branch October 23, 2025 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix Crawler on migration not remembering statistics

4 participants