Description
This is collection of issues or improvements that were discovered during re-implementation of AdaptivePlaywrightCrawler in Python.
-
Ensure isolation of contexts for static / client only browsing.
Example situation: Rendering type predictor decides that both crawling methods should be used. This means that user handler will run twice. User can modify context in user handler, for example "user_data". This can lead to situation where second handler is working on already modified context. -
Default result comparator checks only dataset changes. Maybe add comparison of added links. This on the other hand has to be done carefully as some sites when crawled with browser can generate additional options. Example of "same" link:
Static: https://sdk.apify.com/docs/guides/getting-started
Browser: https://sdk.apify.com/docs/guides/getting-started?__hsfp=1136113150&__hssc=7591405.1.1735494277124&__hstc=7591405.e2b9302ed00c5bfaee3a870166792181.1735494277124.1735494277124.1735494277124.1
-
Document possible edge case of undesired mutation of global state.
In situation where static crawling failed, browser crawling is used as backup. Ifcontext.use_state
method was already used in static crawling, then global state can be modified. -
Consider passing 'Request' object to the RenderingTypePredictor instead of just 'url' and 'label'. Predictor can decide what data it will use and does not have to be limited in advance.
-
Add support for pre-navigation hooks.
-
Tiny mistake in calculateUrlSimilarity
Weighted average over path length is missing -1 in denominator. (First elementh of path is excluded from metric calculation and so it should be also excluded when calculating weighted average)
https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/utils/rendering-type-prediction.ts#L24
TBD ... more will be added during migration