-
Notifications
You must be signed in to change notification settings - Fork 425
Open
Labels
debtCode quality improvement or decrease of technical debt.Code quality improvement or decrease of technical debt.solutioningThe issue is not being implemented but only analyzed and planned.The issue is not being implemented but only analyzed and planned.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Description
Background
Currently, our approach to HTTP fingerprinting is fragmented across different components. This leads to potential inconsistencies where, for example, HTTP headers might not align with TLS fingerprints or device characteristics, making our scrapers easier to detect. Furthermore, tracking down the code responsible for various parts of the fingerprinting functionality is difficult.
Objective
Create a unified approach to HTTP fingerprinting across all Crawlee components to produce more realistic and consistent scraper behavior. This will be ported to JS crawlee as a part of v4.
Proposed Solution
-
Create a
FingerprintProfile
data structure that encapsulates:- HTTP headers collection
- Browser type and version (for TLS impersonation)
- Device characteristics (viewport, screen resolution, etc.)
- Proxy configuration that aligns with the fingerprint's locale/behavior
- potentially any other stuff I forgot about or that will be added later on
-
Integrate this structure across Crawlee components:
- components responsible for fingerprinting should accept a
FingerprintProfile
instance in the API responsible for handling individual requests - HTTP clients should apply appropriate headers and proxy settings
- Browser Pool should select browsers with matching TLS fingerprints and inject appropriate DOM properties (viewport, locale, ...)
- the
FingerprintProfile
should probably be included in theSession
objects - the way the
FingerprintProfile
is generated should be configurable, ideally in a way that allows adding custom code
- components responsible for fingerprinting should accept a
Metadata
Metadata
Assignees
Labels
debtCode quality improvement or decrease of technical debt.Code quality improvement or decrease of technical debt.solutioningThe issue is not being implemented but only analyzed and planned.The issue is not being implemented but only analyzed and planned.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.