-
Couldn't load subscription status.
- Fork 509
fix: Fix BasicCrawler statistics persistence
#1490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
55e7316 to
ebc350d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised we use the SDK_CRAWLER_STATISTICS_... key for state persistence. Why is the SDK prefix in Crawlee? Also, since this is internal, we use a double-underscore prefix (__STORAGE_ALIASES_MAPPING, __RQ_STATE_...) for other cases. Could we update the key name, please?
tests/unit/test_configuration.py
Outdated
| crawler = HttpCrawler( | ||
| configuration=configuration, | ||
| storage_client=storage_client, | ||
| ) | ||
| service_locator.set_configuration(configuration) | ||
| service_locator.set_storage_client(storage_client) | ||
|
|
||
| crawler = HttpCrawler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because RecoverableState of statistics persists to/recovers from global storage_client. And since statistics is persisted by default now, it will try to persist to default global service_client, which is FileSystem... regardless of the crawler-specific storage_client
Mentioned here:
#1438 (comment)
I am open to discussion about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we use the storage client passed to the crawler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but do we want to? I had an inconclusive discussion about this with @janbuchar
I am still not sure about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm kinda disappointed with the amount of edge cases that arose from having a separate service locator for crawlers.
From a "common sense" perspective, the RecoverableState is owned by the crawler and it doesn't make much sense to put the serialized state in a different storage (the global one). Then again, there's a good chance that the crawler-wide storage client will be a memory storage, which is not a great fit for RecoverableState.
But, unless I'm missing something, it should be super rare that somebody will do this intentionally. In my opinion, we should pick one of these options and just show a warning if both the global and crawler-specific storage client are configured.
TODO: Figure out reason for stats difference in request_total_finished_duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
BasicCrawler statistics persistanceBasicCrawler statistics persistence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please look at my comment about service locators and do whatever you deem appropriate.
7ba7190 to
00e65ec
Compare
Description
BasicCrawleris persisting statistics by default.BasicCrawleris recovering existing statistics by default ifConfiguration.purge_on_startis False.BasicCrawleremitEvent.PERSIST_STATEwhen finishing.Issues
Testing