Skip to content

javapuppteernodejs/curl_cffi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

How to Solve TLS/JA3 Fingerprinting in Web Scraping with curl_cffi

Struggling with anti-bot measures while scraping websites? curl_cffi, an advanced Python library wrapping around the cURL tool, can help you bypass these barriers effectively. By mimicking browser behavior and leveraging cURL's features, curl_cffi enhances your scraper’s ability to avoid detection and perform smoothly. In this guide, we’ll explore how curl_cffi works, how to use it for various tasks, and address its limitations. We’ll also discuss potential solutions to overcome these limitations.

What is curl_cffi?

curl_cffi is a Python library designed for network requests, similar to libraries like requests and httpx. However, unlike these libraries, curl_cffi can simulate browser TLS/JA3 and HTTP/2 fingerprints. curl-impersonate is a command-line tool that can simulate four major browsers and perform TLS and HTTP handshakes just like a real browser. curl_cffi wraps curl-impersonate into a Python library using cffi.

Struggling with the repeated failure to completely solve the irritating captcha?

Discover seamless automatic captcha solving with CapSolver AI-powered Auto Web Unblock technology!

Claim Your Bonus Code for top captcha solutions; CapSolver: WEBS. After redeeming it, you will get an extra 5% bonus after each recharge, Unlimited

What is TLS/JA3 Fingerprinting?

Today, most websites use HTTPS. To establish an HTTPS connection, a TLS handshake occurs between the server and client, exchanging information such as supported TLS versions and encryption algorithms. Different clients have varying characteristics, and these details are often stable, allowing servers to identify requests as coming from typical user browsers or automated scripts. JA3 is a common algorithm used to generate TLS fingerprints. It works by concatenating these characteristics and computing an MD5 hash.

Using curl_cffi

The usage of curl_cffi is quite similar to requests. Here’s how you can use requests to get the JA3 fingerprint:

import requests

url = "https://tls.browserleaks.com/json"
r = requests.get(url)
print(r.json())

You might get results like this:

{
    "user_agent": "python-requests/2.32.3",
    "ja3_hash": "8d9f7747675e24454cd9b7ed35c58707",
    "ja3_text": "771,4866-4867-4865-49196-49200-49195-49199-52393-52392-159-158-52394-49327-49325-49326-49324-49188-49192-49187-49191-49162-49172-49161-49171-49315-49311-49314-49310-107-103-57-51-157-156-49313-49309-49312-49308-61-60-53-47-255,0-11-10-16-22-23-49-13-43-45-51-21,29-23-30-25-24,0-1-2",
    "ja3n_hash": "a790a1e311289ac1543f411f6ffceddf",
    "ja3n_text": "771,4866-4867-4865-49196-49200-49195-49199-52393-52392-159-158-52394-49327-49325-49326-49324-49188-49192-49187-49191-49162-49172-49161-49171-49315-49311-49314-49310-107-103-57-51-157-156-49313-49309-49312-49308-61-60-53-47-255,0-10-11-13-16-21-22-23-43-45-49-51,29-23-30-25-24,0-1-2",
    "akamai_hash": "",
    "akamai_text": ""
}

If you make repeated requests, you’ll find that your JA3 hash remains the same. However, starting from Chrome version 110, the TLS ClientHello extension order is randomized, making it easier for website developers to block libraries like requests based on JA3 fingerprints. If your requests consistently show the same JA3 fingerprint, they might be identified as coming from a single user, increasing the likelihood of being flagged as a bot.

Here’s how to use curl_cffi to simulate a real JA3 fingerprint:

from curl_cffi import requests

url = "https://tls.browserleaks.com/json"
r = requests.get(url, impersonate="chrome124")
print(r.json())

The impersonate parameter allows you to specify the browser and version to simulate. Supported browsers include Chrome, Chrome Android, Edge, and Safari, with versions continually updated. For detailed information, refer to the curl_cffi GitHub repository. With curl_cffi, the JA3 fingerprint will align with that of a real browser, and starting from Chrome 110, the JA3 fingerprint will change with each request:

{
    "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "ja3_hash": "c97c8dac4ca1de968fe230de54f3e0f3",
    "ja3_text": "771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,16-10-27-18-5-51-23-17513-45-35-43-13-65281-0-11-65037,25497-29-23-24,0",
    "ja3n_hash": "4c9ce26028c11d7544da00d3f7e4f45c",
    "ja3n_text": "771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,0-5-10-11-13-16-18-23-27-35-43-45-51-17513-65037-65281,25497-29-23-24,0",
    "akamai_hash": "52d84b11737d980aef856699f885ca86",
    "akamai_text": "1:65536;2:0;4:6291456;6:262144|15663105|0|m,a,s,p"
}

Addressing curl_cffi Limitations

While curl_cffi can simulate real JA3 fingerprints and potentially avoid bot challenges and blocks, it may not always be sufficient. Many sites implement advanced bot protection mechanisms such as hCaptcha, reCaptcha, Geetest, Cloudflare Turnstile, DataDome, and AWS WAF. These systems use complex images and difficult-to-read JavaScript challenges to differentiate between humans and bots. Sometimes, even with a genuine and randomized JA3 fingerprint, bypassing these challenges is unavoidable.

If you encounter CAPTCHA challenges, regardless of the request library you use, they may be unavoidable. However, there’s no need to worry. CapSolver provides a solution to these problems. CapSolver uses AI-based automated web unlocking technology to solve various bot challenges in seconds. Whether dealing with images or complex problems, CapSolver can handle it efficiently. If the solution fails, you won’t incur any costs.

CapSolver also offers a browser extension that automatically resolves CAPTCHAs during data scraping with Selenium. Additionally, an API solution is available for resolving CAPTCHAs and obtaining tokens in frameworks like Scrapy. Everything can be accomplished in just a few seconds. For more details, refer to the CapSolver documentation.

Conclusion

By integrating curl_cffi into your web scraping setup, you can effectively mimic real browser behavior to overcome TLS/JA3 fingerprinting challenges. While curl_cffi provides robust tools for handling these challenges, advanced CAPTCHA and bot detection systems still pose significant obstacles. CapSolver offers a complementary solution to tackle these CAPTCHA challenges seamlessly, ensuring your scraping activities run smoothly.

For further insights and resources, check out the CapSolver website and explore the curl_cffi GitHub repository.