-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How the reader bypass the Cloudflare's protection #66
Comments
I am still getting stuck with Cloudflare on certain sites. This site for example still gets detected and stops the Reader from accessing the content: https://seleniumbase.io/help_docs/uc_mode/#uc-mode Does the Reader use something similar to this? It would be nice if it could avoid detection on sites like the example I provided that is still getting caught. |
Looking into the code I see that |
That' not such convinced to me if the reader just bypass the Cloudflare's protection only by this plugin, the last commit of this plugin is at least 1 year ago. I just developed a very similar reader with all the evasions from SeleniumBase seems like a good solution, but I prefer a solution for |
That might be the simple salvage function which queries the Google web cache. reader/backend/functions/src/services/puppeteer.ts Lines 467 to 488 in 7c57123
It's not guaranteed to work, though. Alternative approaches may also include querying from the Web Archive. However the Setting UA to some of the famous bots, like Slackbot, GPTBot, or even GoogleSpider, sometimes also works because the site owner accepts them, but in other cases, it triggers the site to block access directly. |
I found the issue, I'm trying to deploy it to the edge compute, but it seems like too many people requesting Google so it hit the rate limit. The salvage itself works fine. Thanks for your reply :D |
what both of you @nomagick @backrunner talking about? google cache has long been closed. and it will not work anymore. Do i miss something? |
@erikdemarco Previously it was available, now closed. |
I noticed that the reader can read things from pages in
https://openai.com
which has been highly protected by Cloudflare, if only use 'puppeteer-extra-plugin-stealth', it's not enough to bypass the Cloudflare's protection.In the source code, there's nothing to solve the captcha automatically, and no more things about the protection bypass.
What I'd like to inquire about is whether you guys have some other or more under-the-hood changes for puppeteer that make the reader not be detected by cloudflare.
We're trying to privately deploy a similar service, but are having trouble getting a close approximation in terms of accessing page content, mainly because there's no way to get around the protection.
The text was updated successfully, but these errors were encountered: