Replies: 5 comments
-
I agree with this perspective. I think the first 3 categories (search engine, copyright, and site monitoring, plus archive bots) are permissible and worth allowing. We can always add |
Beta Was this translation helpful? Give feedback.
-
I'd vote for throttling them but also I hold no strong opinions on each option. I don't think it's realistic that these bots are going to use the API (with or without an API key). |
Beta Was this translation helpful? Give feedback.
-
Forgot to share a related issue: #3900 Some of these bots do not seem to honor robots.txt unfortunately. |
Beta Was this translation helpful? Give feedback.
-
Agreed. Block them based on user agent as we find them.
We can try to block these using user agent string matches, but if they are ignoring robots.txt then I'd wager they don't care about providing any inteligible heuristic for servers to identify them by (i.e., they intentionally obscure themselves so they cannot be blocked, or in yet other words, they are bad actors on the open web). Throttling will work in those cases, managed challenges, etc. All of which can be worked around by someone persistent enough, but we'll never solve that problem and it's a waste of our time and brain power to try to. If we stop the most common cases (regular crawlers we don't want) and stick to the easiest methods (robots.txt and basic UA matching when that's ignored), then I think we'll be fine. We really need to be okay with just accepting that some bad actors do not care to respect our resources, and there's only so much that's worth doing to try to prevent them anyway (at least specifically). Anything we do for general bot protection will still cover ones that aren't stopped by specific rules. |
Beta Was this translation helpful? Give feedback.
-
Let's plan to block in Cloudflare with UA matching for known-but-agressive crawlers we don't want to provide access to. Thanks for the feedback everyone. |
Beta Was this translation helpful? Give feedback.
-
The modern web is full of automated traffic from search engines and scrapers. The proliferation of AI has exacerbated this issue significantly.
Cloudflare has some useful docs describing different bots, although they generously refer to many things as "good bots"1:
At Openverse, we need to think about how to handle these. Which do we block? Which do we attempt to throttle, with
Crawl-Delay
in robots.txt?My personal opinion is that we go ahead and aggressively block any AI and Machine learning bots with firewall rules in Cloudflare. With the existence of the Openverse API, I think it's extremely reasonable that most programmatic access of our data happens there and not through frontend scraping.
Footnotes
https://www.cloudflare.com/learning/bots/how-to-manage-good-bots/ ↩
Beta Was this translation helpful? Give feedback.
All reactions