Skip to content

Conversation

@barjin
Copy link
Member

@barjin barjin commented Dec 17, 2025

Adds unique crawler id to the BasicCrawler class. Prints a warning on multiple crawlers sharing the state on useState().

There might be more resources shared between different BasicCrawler (-subclass) instances - this needs further investigation.

Closes #3024

@barjin barjin self-assigned this Dec 17, 2025
@barjin barjin requested review from B4nan and janbuchar December 17, 2025 16:51
@barjin
Copy link
Member Author

barjin commented Dec 17, 2025

Looking at the current changes, this is actually not breaking and we could merge it to v3 now.

Thoughts @janbuchar @B4nan ? The id field name is up for discussion.

@janbuchar
Copy link
Contributor

Looking at the current changes, this is actually not breaking and we could merge it to v3 now.

Thoughts @janbuchar @B4nan ? The id field name is up for discussion.

I slightly prefer going for v3. id sounds a bit too generic to me. Maybe we could just pass in the whole state key.

@B4nan
Copy link
Member

B4nan commented Dec 17, 2025

I guess we should handle the crawler statistics this way too, and I would rather keep that for v4. And saying that, I would actually prefer some generic name not bound to the useState case.

@barjin
Copy link
Member Author

barjin commented Dec 18, 2025

The last commits are passing the crawler id to the Statistics constructor - based on my (not too extensive) research, that's all the resources that could have been shared / swapped by mistake before.

Regarding the identifier naming, I really don't have many ideas. How about (crawler|instance)Id?

@janbuchar
Copy link
Contributor

janbuchar commented Dec 18, 2025

SessionPool also comes to mind. Not sure if it's a problem to share it?

@barjin
Copy link
Member Author

barjin commented Dec 18, 2025

I believe there are valid reasons to share the Session instances (e.g. getting cookies using PlaywrightCrawler and then processing the data using HttpCrawler using these cookies). Currently the SessionPool instances all use the same KVS key to store their state.

But this is a good point nonetheless - we should consider the implications of this in the UserPool initiative.

@barjin barjin merged commit 91a7b4c into v4 Dec 18, 2025
6 checks passed
@barjin barjin deleted the fix/shared-use-state branch December 18, 2025 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants