Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How the reader bypass the Cloudflare's protection #66

Closed
backrunner opened this issue May 22, 2024 · 7 comments
Closed

How the reader bypass the Cloudflare's protection #66

backrunner opened this issue May 22, 2024 · 7 comments

Comments

@backrunner
Copy link

I noticed that the reader can read things from pages in https://openai.com which has been highly protected by Cloudflare, if only use 'puppeteer-extra-plugin-stealth', it's not enough to bypass the Cloudflare's protection.

In the source code, there's nothing to solve the captcha automatically, and no more things about the protection bypass.

What I'd like to inquire about is whether you guys have some other or more under-the-hood changes for puppeteer that make the reader not be detected by cloudflare.

We're trying to privately deploy a similar service, but are having trouble getting a close approximation in terms of accessing page content, mainly because there's no way to get around the protection.

@nashdean
Copy link

I am still getting stuck with Cloudflare on certain sites. This site for example still gets detected and stops the Reader from accessing the content: https://www.podbean.com/site/search/index?v={SEARCH+QUERY+HERE}. Before I found Jina-AI, I was able to bypass this by using seleniumbase BaseCase class. If you are trying a custom solution to avoid detection, I suggest checking out seleniumbase`. The creator has good documentation and many YouTube tutorials on its use (Its a wrapper of Selenium).

https://seleniumbase.io/help_docs/uc_mode/#uc-mode

Does the Reader use something similar to this? It would be nice if it could avoid detection on sites like the example I provided that is still getting caught.

@nashdean
Copy link

Looking into the code I see that puppeteer-extra-plugin-stealth is being used by Reader. Is the team going to be expanding to also add an option for SeleniumBase? It would be nice to have a uc-mode option which is what works for me right now with CloudFlare.

@backrunner
Copy link
Author

backrunner commented May 24, 2024

puppeteer-extra-plugin-stealth

puppeteer-extra-plugin-stealth could be detected by Cloudflare, References: https://github.com/berstend/puppeteer-extra/issues?q=is%3Aissue+is%3Aopen+cloudflare

That' not such convinced to me if the reader just bypass the Cloudflare's protection only by this plugin, the last commit of this plugin is at least 1 year ago.

I just developed a very similar reader with all the evasions from plugin-stealth, and that doesn't work.

SeleniumBase seems like a good solution, but I prefer a solution for puppeteer currently then I can deploy it to the edge with my solution in TypeScript.

@nomagick
Copy link
Member

That might be the simple salvage function which queries the Google web cache.

async salvage(url: string, page: Page) {
this.logger.info(`Salvaging ${url}`);
const googleArchiveUrl = `https://webcache.googleusercontent.com/search?q=cache:${encodeURIComponent(url)}`;
const resp = await fetch(googleArchiveUrl, {
headers: {
'User-Agent': `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)`
}
});
resp.body?.cancel().catch(() => void 0);
if (!resp.ok) {
this.logger.warn(`No salvation found for url: ${url}`, { status: resp.status, url });
return null;
}
await page.goto(googleArchiveUrl, { waitUntil: ['load', 'domcontentloaded', 'networkidle0'], timeout: 15_000 }).catch((err) => {
this.logger.warn(`Page salvation did not fully succeed.`, { err: marshalErrorLike(err) });
});
this.logger.info(`Salvation completed.`);
return true;
}

It's not guaranteed to work, though.

Alternative approaches may also include querying from the Web Archive.

However the puppeteer-extra-plugin-stealth somehow doesn't work at some level with Web Archive

Setting UA to some of the famous bots, like Slackbot, GPTBot, or even GoogleSpider, sometimes also works because the site owner accepts them, but in other cases, it triggers the site to block access directly.

@backrunner
Copy link
Author

That might be the simple salvage function which queries the Google web cache.

async salvage(url: string, page: Page) {
this.logger.info(`Salvaging ${url}`);
const googleArchiveUrl = `https://webcache.googleusercontent.com/search?q=cache:${encodeURIComponent(url)}`;
const resp = await fetch(googleArchiveUrl, {
headers: {
'User-Agent': `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)`
}
});
resp.body?.cancel().catch(() => void 0);
if (!resp.ok) {
this.logger.warn(`No salvation found for url: ${url}`, { status: resp.status, url });
return null;
}
await page.goto(googleArchiveUrl, { waitUntil: ['load', 'domcontentloaded', 'networkidle0'], timeout: 15_000 }).catch((err) => {
this.logger.warn(`Page salvation did not fully succeed.`, { err: marshalErrorLike(err) });
});
this.logger.info(`Salvation completed.`);
return true;
}

It's not guaranteed to work, though.

Alternative approaches may also include querying from the Web Archive.

However the puppeteer-extra-plugin-stealth somehow doesn't work at some level with Web Archive

Setting UA to some of the famous bots, like Slackbot, GPTBot, or even GoogleSpider, sometimes also works because the site owner accepts them, but in other cases, it triggers the site to block access directly.

That might be the simple salvage function which queries the Google web cache.

async salvage(url: string, page: Page) {
this.logger.info(`Salvaging ${url}`);
const googleArchiveUrl = `https://webcache.googleusercontent.com/search?q=cache:${encodeURIComponent(url)}`;
const resp = await fetch(googleArchiveUrl, {
headers: {
'User-Agent': `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)`
}
});
resp.body?.cancel().catch(() => void 0);
if (!resp.ok) {
this.logger.warn(`No salvation found for url: ${url}`, { status: resp.status, url });
return null;
}
await page.goto(googleArchiveUrl, { waitUntil: ['load', 'domcontentloaded', 'networkidle0'], timeout: 15_000 }).catch((err) => {
this.logger.warn(`Page salvation did not fully succeed.`, { err: marshalErrorLike(err) });
});
this.logger.info(`Salvation completed.`);
return true;
}

It's not guaranteed to work, though.

Alternative approaches may also include querying from the Web Archive.

However the puppeteer-extra-plugin-stealth somehow doesn't work at some level with Web Archive

Setting UA to some of the famous bots, like Slackbot, GPTBot, or even GoogleSpider, sometimes also works because the site owner accepts them, but in other cases, it triggers the site to block access directly.

I found the issue, I'm trying to deploy it to the edge compute, but it seems like too many people requesting Google so it hit the rate limit. The salvage itself works fine.

Thanks for your reply :D

@erikdemarco
Copy link

what both of you @nomagick @backrunner talking about? google cache has long been closed. and it will not work anymore. Do i miss something?

@nomagick
Copy link
Member

nomagick commented Nov 8, 2024

@erikdemarco Previously it was available, now closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants