How should we handle various bots and crawlers? #3916

zackkrida · 2024-03-14T15:44:16Z

zackkrida
Mar 14, 2024
Collaborator

The modern web is full of automated traffic from search engines and scrapers. The proliferation of AI has exacerbated this issue significantly.

Cloudflare has some useful docs describing different bots, although they generously refer to many things as "good bots"¹:

Search engine bots: Also known as web crawlers or spiders: These bots "crawl," or review, content on almost every website on the Internet, and then index that content so that it can show up in search engine results for relevant user searches. They're operated by search engines like Google, Bing, or Yandex.
Copyright bots: Bots that crawl platforms or websites looking for content that may violate copyright law. These bots can be operated by any person or company who owns copyrighted material. Copyright bots can look for duplicated text, music, images, or even videos.
Site monitoring bots: These bots monitor website metrics – for example, monitoring for backlinks or system outages – and can alert users of major changes or downtime. For instance, Cloudflare operates a crawler bot called Always Online that tells the Cloudflare network to serve a cached version of a webpage if the origin server is down.
Commercial bots: Bots operated by commercial companies that crawl the Internet for information. These bots may be operated by market research companies monitoring news reports or customer reviews, ad networks optimizing the places where they display ads, or SEO agencies that crawl clients' websites.
Feed bots: These bots crawl the Internet looking for newsworthy content to add to a platform's news feed. Content aggregator sites or social media networks may operate these bots.
Chatbots: Chatbots imitate human conversation by answering users with preprogrammed responses. Some chatbots are complex enough to carry on lengthy conversations.
Personal assistant bots: like Siri or Alexa: Although these programs are much more advanced than the typical bot, they are bots nonetheless: computer programs that browse the web for data.

At Openverse, we need to think about how to handle these. Which do we block? Which do we attempt to throttle, with Crawl-Delay in robots.txt?

My personal opinion is that we go ahead and aggressively block any AI and Machine learning bots with firewall rules in Cloudflare. With the existence of the Openverse API, I think it's extremely reasonable that most programmatic access of our data happens there and not through frontend scraping.

https://www.cloudflare.com/learning/bots/how-to-manage-good-bots/ ↩

AetherUnbound · 2024-03-15T16:30:32Z

AetherUnbound
Mar 15, 2024
Collaborator

My personal opinion is that we go ahead and aggressively block any AI and Machine learning bots with firewall rules in Cloudflare. With the existence of the Openverse API, I think it's extremely reasonable that most programmatic access of our data happens there and not through frontend scraping.

I agree with this perspective. I think the first 3 categories (search engine, copyright, and site monitoring, plus archive bots) are permissible and worth allowing. We can always add Crawl-Delay if we find they're affecting our performance. Otherwise, the other kinds of bots should be using our API as you mentioned.

0 replies

krysal · 2024-03-28T22:49:15Z

krysal
Mar 28, 2024
Maintainer

I'd vote for throttling them but also I hold no strong opinions on each option. I don't think it's realistic that these bots are going to use the API (with or without an API key).

0 replies

zackkrida · 2024-03-28T23:00:02Z

zackkrida
Mar 28, 2024
Collaborator Author

Forgot to share a related issue: #3900

Some of these bots do not seem to honor robots.txt unfortunately.

0 replies

sarayourfriend · 2024-04-09T03:31:03Z

sarayourfriend
Apr 9, 2024
Collaborator

My personal opinion is that we go ahead and aggressively block any AI and Machine learning bots with firewall rules in Cloudflare. With the existence of the Openverse API, I think it's extremely reasonable that most programmatic access of our data happens there and not through frontend scraping.

Agreed.

Block them based on user agent as we find them.

Some of these bots do not seem to honor robots.txt unfortunately.

We can try to block these using user agent string matches, but if they are ignoring robots.txt then I'd wager they don't care about providing any inteligible heuristic for servers to identify them by (i.e., they intentionally obscure themselves so they cannot be blocked, or in yet other words, they are bad actors on the open web).

Throttling will work in those cases, managed challenges, etc. All of which can be worked around by someone persistent enough, but we'll never solve that problem and it's a waste of our time and brain power to try to. If we stop the most common cases (regular crawlers we don't want) and stick to the easiest methods (robots.txt and basic UA matching when that's ignored), then I think we'll be fine. We really need to be okay with just accepting that some bad actors do not care to respect our resources, and there's only so much that's worth doing to try to prevent them anyway (at least specifically). Anything we do for general bot protection will still cover ones that aren't stopped by specific rules.

0 replies

zackkrida · 2024-04-09T20:23:46Z

zackkrida
Apr 9, 2024
Collaborator Author

Let's plan to block in Cloudflare with UA matching for known-but-agressive crawlers we don't want to provide access to. Thanks for the feedback everyone.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should we handle various bots and crawlers? #3916

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How should we handle various bots and crawlers? #3916

zackkrida Mar 14, 2024 Collaborator

Footnotes

Replies: 5 comments

AetherUnbound Mar 15, 2024 Collaborator

krysal Mar 28, 2024 Maintainer

zackkrida Mar 28, 2024 Collaborator Author

sarayourfriend Apr 9, 2024 Collaborator

zackkrida Apr 9, 2024 Collaborator Author

zackkrida
Mar 14, 2024
Collaborator

AetherUnbound
Mar 15, 2024
Collaborator

krysal
Mar 28, 2024
Maintainer

zackkrida
Mar 28, 2024
Collaborator Author

sarayourfriend
Apr 9, 2024
Collaborator

zackkrida
Apr 9, 2024
Collaborator Author