Skip to content

torchbox-forks/wagtail-robots

 
 

Repository files navigation

Wagtail Robots

CAUTION!

This is a fork of the original wagtail-robots package. The original package maintainer hasn't accepted PRs for a long time.

Keep the master branch up to date with the maintainers' master branch. If that has work not on our master branch, merge it to our master branch if necessary.

The current latest TAG: https://github.com/torchbox-forks/wagtail-robots/releases/tag/1.2.2%2Btbx

This tag supports Wagtail versions less than 5.2. If you are using Wagtail 5.2 or above, you should use the latest tag shown in releases.

Use the current latest TAG in your project requirements.

For poetry users:

[tool.poetry.dependencies]
wagtail-robots = { git = "https://github.com/torchbox-forks/wagtail-robots", tag="1.2.2+tbx" }

Development

Developing new work/fixes/upgrades should be based on the latest master branch and merged back to the master branch.

If you consider any new work needs a new release then create a new branch from the master branch once your work is merged to the master branch. Name your new branch using the convention 'stable/[version]' where version is the next version number you want to use. (any branches using the stable prefix are automatically protected) We treat each stable branch as a snapshot of the codebase at the time of the release. We don't merge any further work to the stable branch.

Then create the new release and tag it with the new version number and add a suffix of '+tbx' to the version number.

We don't publish new releases to PyPI. We only use the package from the git repository.

END CAUTION

This is a basic Django application for Wagtail to manage robots.txt files following the robots exclusion protocol, complementing the Django Sitemap contrib app.

This started as a fork of Django Robots but because of the differences between the Django Admin and the Wagtail Admin, and other project requirements git history has not been retained.

For installation and configuration instructions, keep reading.

Contents:

.. toctree::
   :maxdepth: 1

   screenshots

Installation

Use your favorite Python installer to install it from PyPI:

pip install wagtail-robots

Or get the source from the application site at:

http://github.com/adrian-turjak/wagtail-robots/

Then follow these steps:

  1. Add 'wagtail_modeladmin' and 'robots' to your INSTALLED_APPS setting.
  2. Run the migrate management command

You may want to additionally setup the Wagtail sitemap generator.

And if you install or already happen to be using CondensedInlinePanel this library will automatically use it in place of InlinePanel for the Rule create and edit pages.

Initialization

To activate robots.txt generation on your Wagtail site, add this line to your URLconf:

re_path(r'^robots\.txt', include('robots.urls')),

This tells Django to build a robots.txt when a robot accesses /robots.txt. Then, please migrate your database to create the necessary tables and create Rule objects in the admin interface or via the shell.

Rules

Rule - defines an abstract rule which is used to respond to crawling web robots, using the robots exclusion protocol, a.k.a. robots.txt.

You can link multiple URL pattern to allows or disallows the robot identified by its user agent to access the given URLs.

The crawl delay field is supported by some search engines and defines the delay between successive crawler accesses in seconds. If the crawler rate is a problem for your server, you can set the delay up to 5 or 10 or a comfortable value for your server, but it's suggested to start with small values (0.5-1), and increase as needed to an acceptable value for your server. Larger delay values add more delay between successive crawl accesses and decrease the maximum crawl rate to your web server.

The Wagtail sites are used to enable multiple robots.txt per Wagtail instance. If no rule exists it automatically allows every web robot access to every URL except Wagtail's admin path (/admin).

Please have a look at the database of web robots for a full list of existing web robots user agent strings.

URLs

Url - defines a case-sensitive and exact URL pattern which is used to allow or disallow the access for web robots. Case-sensitive.

A missing trailing slash does also match files which start with the name of the given pattern, e.g., '/admin' matches /admin.html too.

Some major search engines allow an asterisk (*) as a wildcard to match any sequence of characters and a dollar sign ($) to match the end of the URL, e.g., '/*.jpg$' can be used to match all jpeg files.

Caching

You can optionally cache the generation of the robots.txt. Add or change the ROBOTS_CACHE_TIMEOUT setting with a value in seconds in your Django settings file:

ROBOTS_CACHE_TIMEOUT = 60*60*24

This tells Django to cache the robots.txt for 24 hours (86400 seconds). The default value is None (no caching).

If you need to, you can also specify exactly which cache to use:

ROBOTS_CACHE_ALIAS="robots"

Unless specified otherwise it will use the default cache.

Sitemaps

By default a Sitemap statement is automatically added to the resulting robots.txt by reverse matching the URL of the installed Wagtail Sitemap app. This is especially useful if you allow every robot to access your whole site, since it then gets URLs explicitly instead of searching every link.

To change the default behaviour to omit the inclusion of a sitemap link, change the ROBOTS_USE_SITEMAP setting in your Django settings file to:

ROBOTS_USE_SITEMAP = False

In case you want to use specific sitemap URLs instead of the one that is automatically discovered, change the ROBOTS_SITEMAP_URLS setting to:

ROBOTS_SITEMAP_URLS = [
    'http://www.example.com/sitemap.xml',
]

If the sitemap is wrapped in a decorator, dotted path reverse to discover the sitemap URL does not work. To overcome this, provide a name to the sitemap instance in urls.py:

urlpatterns = [
    ...
    url(r'^sitemap.xml$', cache_page(60)(sitemap_view), {'sitemaps': [...]}, name='cached-sitemap'),
    ...
]

and inform django-robots about the view name by adding the following setting:

ROBOTS_SITEMAP_VIEW_NAME = 'cached-sitemap'

Use ROBOTS_SITEMAP_VIEW_NAME also if you use custom sitemap views.

Host directive

By default a Host statement is automatically added to the resulting robots.txt to avoid mirrors and select the main website properly.

To change the default behaviour to omit the inclusion of host directive, change the ROBOTS_USE_HOST setting in your Django settings file to:

ROBOTS_USE_HOST = False

if you want to prefix the domain with the current request protocol (http or https as in Host: https://www.mysite.com) add this setting:

ROBOTS_USE_SCHEME_IN_HOST = True

Development/Staging Override

Sometimes when you have duplicate database content in both a production and staging website, it can be useful to override any and all database entries for the this application and explicitly disallow all.

To do that add this setting:

ROBOTS_DISALLOW_ALL = True

The resulting robots.txt will look as follows:

User-agent: *
Disallow: /

Bugs and feature requests

As always your mileage may vary, so please don't hesitate to send feature requests and bug reports:

https://github.com/adrian-turjak/wagtail-robots/issues

About

Robots.txt exclusion for Wagtail, complementing Sitemaps.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 95.6%
  • HTML 2.7%
  • Makefile 1.7%