@@ -49,92 +49,63 @@ Table of Contents
49
49
1. Abstract
50
50
51
51
This memo defines a method for administrators of sites on the World-
52
- Wide Web to give instructions to visiting Web robots, most
53
- importantly what areas of the site are to be avoided.
54
-
55
- This document provides a more rigid specification of the Standard
56
- for Robots Exclusion [1], which is currently in wide-spread use by
57
- the Web community since 1994.
58
-
52
+ Wide Web to allow automated tools to easily discover paths to
53
+ datasets and describe their contents.
59
54
60
55
2. Introduction
61
56
62
- Web Robots (also called "Wanderers" or "Spiders") are Web client
63
- programs that automatically traverse the Web's hypertext structure
64
- by retrieving a document, and recursively retrieving all documents
65
- that are referenced.
66
-
67
- Note that "recursively" here doesn't limit the definition to any
68
- specific traversal algorithm; even if a robot applies some heuristic
69
- to the selection and order of documents to visit and spaces out
70
- requests over a long space of time, it qualifies to be called a
71
- robot.
72
-
73
- Robots are often used for maintenance and indexing purposes, by
74
- people other than the administrators of the site being visited. In
75
- some cases such visits may have undesirable effects which the
76
- administrators would like to prevent, such as indexing of an
77
- unannounced site, traversal of parts of the site which require vast
78
- resources of the server, recursive traversal of an infinite URL
79
- space, etc.
57
+ Data Catalogs have been embraced by government agencies, nonprofit
58
+ organizations, and for-profit organizations that aim to provide
59
+ open access to data and transparency. A data catalog is often
60
+ comprised of a web interface to search and download datasets;
61
+ however, it is hard to automate the discovery and extraction of
62
+ datasets since each catalog implements a different set of endpoints
63
+ and protocols.
80
64
81
65
The technique specified in this memo allows Web site administrators
82
- to indicate to visiting robots which parts of the site should be
83
- avoided. It is solely up to the visiting robot to consult this
84
- information and act accordingly. Blocking parts of the Web site
85
- regardless of a robot's compliance with this method are outside
86
- the scope of this memo.
87
-
66
+ to indicate to visiting humans and robots where datasets are
67
+ locateed within their site.
68
+
69
+ It is solely up to the visitor to consult this information and act
70
+ accordingly. By searching the index, rendering metadata associated
71
+ with each dataset and download them without having to rely on
72
+ complex methods to infer if a given page withibn the site contains
73
+ datasets, or not.
88
74
89
75
3. The Specification
90
76
91
- This memo specifies a format for encoding instructions to visiting
92
- robots , and specifies an access method to retrieve these
93
- instructions. Robots must retrieve these instructions before visiting
94
- other URLs on the site, and use the instructions to determine if
95
- other URLs on the site can be accessed .
77
+ This memo specifies a format for exposing datasets to Web site
78
+ visitors , and specifies an access method to retrieve these
79
+ instructions. Visitors can then choose to provide custom data
80
+ catalogs that support one or many Web sites through user interface
81
+ or automated methods .
96
82
97
83
3.1 Access method
98
84
99
85
The instructions must be accessible via HTTP [2] from the site that
100
86
the instructions are to be applied to, as a resource of Internet
101
87
Media Type [3] "text/plain" under a standard relative path on the
102
- server: "/robots .txt".
88
+ server: "/data .txt".
103
89
104
- For convenience we will refer to this resource as the "/robots .txt
90
+ For convenience we will refer to this resource as the "/data .txt
105
91
file", though the resource need in fact not originate from a file-
106
92
system.
107
93
108
94
Some examples of URLs [4] for sites and URLs for corresponding
109
- "/robots .txt" sites:
95
+ "/data .txt" sites:
110
96
111
- http ://www.foo.com/welcome.html http ://www.foo.com/robots .txt
97
+ https ://datatxt.org/ https ://datatxt.org/data .txt
112
98
113
- http://www.bar.com:8001/ http://www.bar.com:8001/robots .txt
99
+ http://www.bar.com:8001/ http://www.bar.com:8001/data .txt
114
100
115
101
If the server response indicates Success (HTTP 2xx Status Code,)
116
- the robot must read the content, parse it, and follow any
117
- instructions applicable to that robot.
102
+ the visitor can read the content, parse it, and present an
103
+ appropriate user interface or automated method to process the
104
+ Web site as a source for a given data catalog.
118
105
119
106
If the server response indicates the resource does not exist (HTTP
120
- Status Code 404), the robot can assume no instructions are
121
- available, and that access to the site is not restricted by
122
- /robots.txt.
123
-
124
- Specific behaviors for other server responses are not required by
125
- this specification, though the following behaviours are recommended:
126
-
127
- - On server response indicating access restrictions (HTTP Status
128
- Code 401 or 403) a robot should regard access to the site
129
- completely restricted.
130
-
131
- - On the request attempt resulted in temporary failure a robot
132
- should defer visits to the site until such time as the resource
133
- can be retrieved.
134
-
135
- - On server response indicating Redirection (HTTP Status Code 3XX)
136
- a robot should follow the redirects until a resource can be
137
- found.
107
+ Status Code 404), the visitor should assume the Web site might
108
+ contain datasets which require to be discovered by other methods.
138
109
139
110
3.2 File Format Description
140
111
0 commit comments