Skip to content

Commit 36028e2

Browse files
intro sections
1 parent 39549be commit 36028e2

File tree

1 file changed

+32
-61
lines changed

1 file changed

+32
-61
lines changed

datatxt-spec.txt

Lines changed: 32 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -49,92 +49,63 @@ Table of Contents
4949
1. Abstract
5050

5151
This memo defines a method for administrators of sites on the World-
52-
Wide Web to give instructions to visiting Web robots, most
53-
importantly what areas of the site are to be avoided.
54-
55-
This document provides a more rigid specification of the Standard
56-
for Robots Exclusion [1], which is currently in wide-spread use by
57-
the Web community since 1994.
58-
52+
Wide Web to allow automated tools to easily discover paths to
53+
datasets and describe their contents.
5954

6055
2. Introduction
6156

62-
Web Robots (also called "Wanderers" or "Spiders") are Web client
63-
programs that automatically traverse the Web's hypertext structure
64-
by retrieving a document, and recursively retrieving all documents
65-
that are referenced.
66-
67-
Note that "recursively" here doesn't limit the definition to any
68-
specific traversal algorithm; even if a robot applies some heuristic
69-
to the selection and order of documents to visit and spaces out
70-
requests over a long space of time, it qualifies to be called a
71-
robot.
72-
73-
Robots are often used for maintenance and indexing purposes, by
74-
people other than the administrators of the site being visited. In
75-
some cases such visits may have undesirable effects which the
76-
administrators would like to prevent, such as indexing of an
77-
unannounced site, traversal of parts of the site which require vast
78-
resources of the server, recursive traversal of an infinite URL
79-
space, etc.
57+
Data Catalogs have been embraced by government agencies, nonprofit
58+
organizations, and for-profit organizations that aim to provide
59+
open access to data and transparency. A data catalog is often
60+
comprised of a web interface to search and download datasets;
61+
however, it is hard to automate the discovery and extraction of
62+
datasets since each catalog implements a different set of endpoints
63+
and protocols.
8064

8165
The technique specified in this memo allows Web site administrators
82-
to indicate to visiting robots which parts of the site should be
83-
avoided. It is solely up to the visiting robot to consult this
84-
information and act accordingly. Blocking parts of the Web site
85-
regardless of a robot's compliance with this method are outside
86-
the scope of this memo.
87-
66+
to indicate to visiting humans and robots where datasets are
67+
locateed within their site.
68+
69+
It is solely up to the visitor to consult this information and act
70+
accordingly. By searching the index, rendering metadata associated
71+
with each dataset and download them without having to rely on
72+
complex methods to infer if a given page withibn the site contains
73+
datasets, or not.
8874

8975
3. The Specification
9076

91-
This memo specifies a format for encoding instructions to visiting
92-
robots, and specifies an access method to retrieve these
93-
instructions. Robots must retrieve these instructions before visiting
94-
other URLs on the site, and use the instructions to determine if
95-
other URLs on the site can be accessed.
77+
This memo specifies a format for exposing datasets to Web site
78+
visitors, and specifies an access method to retrieve these
79+
instructions. Visitors can then choose to provide custom data
80+
catalogs that support one or many Web sites through user interface
81+
or automated methods.
9682

9783
3.1 Access method
9884

9985
The instructions must be accessible via HTTP [2] from the site that
10086
the instructions are to be applied to, as a resource of Internet
10187
Media Type [3] "text/plain" under a standard relative path on the
102-
server: "/robots.txt".
88+
server: "/data.txt".
10389

104-
For convenience we will refer to this resource as the "/robots.txt
90+
For convenience we will refer to this resource as the "/data.txt
10591
file", though the resource need in fact not originate from a file-
10692
system.
10793

10894
Some examples of URLs [4] for sites and URLs for corresponding
109-
"/robots.txt" sites:
95+
"/data.txt" sites:
11096

111-
http://www.foo.com/welcome.html http://www.foo.com/robots.txt
97+
https://datatxt.org/ https://datatxt.org/data.txt
11298

113-
http://www.bar.com:8001/ http://www.bar.com:8001/robots.txt
99+
http://www.bar.com:8001/ http://www.bar.com:8001/data.txt
114100

115101
If the server response indicates Success (HTTP 2xx Status Code,)
116-
the robot must read the content, parse it, and follow any
117-
instructions applicable to that robot.
102+
the visitor can read the content, parse it, and present an
103+
appropriate user interface or automated method to process the
104+
Web site as a source for a given data catalog.
118105

119106
If the server response indicates the resource does not exist (HTTP
120-
Status Code 404), the robot can assume no instructions are
121-
available, and that access to the site is not restricted by
122-
/robots.txt.
123-
124-
Specific behaviors for other server responses are not required by
125-
this specification, though the following behaviours are recommended:
126-
127-
- On server response indicating access restrictions (HTTP Status
128-
Code 401 or 403) a robot should regard access to the site
129-
completely restricted.
130-
131-
- On the request attempt resulted in temporary failure a robot
132-
should defer visits to the site until such time as the resource
133-
can be retrieved.
134-
135-
- On server response indicating Redirection (HTTP Status Code 3XX)
136-
a robot should follow the redirects until a resource can be
137-
found.
107+
Status Code 404), the visitor should assume the Web site might
108+
contain datasets which require to be discovered by other methods.
138109

139110
3.2 File Format Description
140111

0 commit comments

Comments
 (0)