Skip to content

Add Blacklight-specific content to default robots.txt #3785

@maxkadel

Description

@maxkadel

A lot of Blacklight applications have been struggling because of bots. The Rails generator creates a default robots.txt that only contains # See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file.

We should add some Blacklight-specific configuration to make it easier for new Blacklight apps to manage bot traffic.

@tpendragon and @jrochkind found some configuration that helps avoid bots trying to crawl all of your facets, hammering Solr with complex nonsensical queries.

Suggestion:

# See https://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
# User-agent: *
# Disallow: /
User-agent: *
# Disable crawling filters - it'll just slow down discovery of useful resources - better that they page.
Disallow: /catalog*f[
Disallow: /catalog*f%5B
# Bots can't log in.
Disallow: /users

Acceptance criteria

  • Newly generated Blacklight applications update the default public/robots.txt to include paths that generally should not be crawled by bots for Blacklight applications.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions