Skip to content

Commit ddc2468

Browse files
committed
Release #597
2 parents 04f2c21 + afacb3a commit ddc2468

File tree

4 files changed

+116
-43
lines changed

4 files changed

+116
-43
lines changed

.env.example

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,13 @@ MAIL_SENDER='some-email-account@example.com'
2323
# MAIL_SMTP_PASSWORD='XXX'
2424
# MAIL_SMTP_TLS='true'
2525

26-
# Controls URLs that won't be downloaded and re-hosted when importing versions
26+
# URLs that won't be downloaded and re-hosted when importing versions.
27+
# When new page or version data is imported (e.g. via `POST /api/v0/imports`),
28+
# the `uri` field points to a location where the raw HTTP response body is
29+
# stored. If the `uri` host does *not* match one of the values in
30+
# `ALLOWED_ARCHIVE_HOSTS`, the application downloads the data from `uri` and
31+
# stores it (see `lib/archiver` for more). That way, we can ensure data is
32+
# always available to API users from a reliable public location.
2733
ALLOWED_ARCHIVE_HOSTS='https://edgi-web-monitoring-db.s3.amazonaws.com/ https://edgi-wm-versionista.s3.amazonaws.com/ https://edgi-wm-versionista.s3-us-west-2.amazonaws.com/ https://s3-us-west-2.amazonaws.com/edgi-wm-versionista/ https://edgi-versionista-archive.s3.amazonaws.com/ https://edgi-versionista-archive.s3-us-west-2.amazonaws.com/ https://s3-us-west-2.amazonaws.com/edgi-versionista-archive/'
2834

2935
# OPTIONAL: Uncomment & fill in to use S3 for storage instead of your local

Gemfile

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,25 +7,25 @@ end
77

88
ruby '2.6.3'
99

10-
gem 'aws-sdk-s3', '~> 1.46'
10+
gem 'aws-sdk-s3', '~> 1.48'
1111
gem 'devise'
1212
gem 'httparty'
1313
gem 'jwt', '~> 2.2'
1414
gem 'rails', '~> 5.2.3'
1515
gem 'pg', '~> 1.1'
16-
gem 'puma', '~> 4.0'
16+
gem 'puma', '~> 4.1'
1717
gem 'rack-cors', :require => 'rack/cors'
1818
gem 'resque'
1919
gem 'resque-heroku-signals'
2020
gem 'sassc-rails', '~> 2.1.2'
2121
gem 'uglifier', '>= 1.3.0'
22-
gem 'oj', '~> 3.8'
22+
gem 'oj', '~> 3.9'
2323
gem 'pundit'
2424
gem 'sentry-raven'
2525
gem 'readthis'
2626
gem 'hiredis'
2727
gem 'google-api-client'
28-
gem 'addressable', '~> 2.6'
28+
gem 'addressable', '~> 2.7'
2929

3030
# See https://github.com/rails/execjs#readme for more supported runtimes
3131
# gem 'therubyracer', platforms: :ruby
@@ -64,7 +64,7 @@ end
6464
group :test do
6565
gem 'capybara'
6666
gem 'capybara-email'
67-
gem 'webmock', '~> 3.6'
67+
gem 'webmock', '~> 3.7'
6868
end
6969

7070
group :production do

Gemfile.lock

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -42,33 +42,33 @@ GEM
4242
i18n (>= 0.7, < 2)
4343
minitest (~> 5.1)
4444
tzinfo (~> 1.1)
45-
addressable (2.6.0)
46-
public_suffix (>= 2.0.2, < 4.0)
45+
addressable (2.7.0)
46+
public_suffix (>= 2.0.2, < 5.0)
4747
arel (9.0.0)
4848
ast (2.4.0)
4949
aws-eventstream (1.0.3)
50-
aws-partitions (1.195.0)
51-
aws-sdk-core (3.61.2)
50+
aws-partitions (1.207.0)
51+
aws-sdk-core (3.65.1)
5252
aws-eventstream (~> 1.0, >= 1.0.2)
5353
aws-partitions (~> 1.0)
5454
aws-sigv4 (~> 1.1)
5555
jmespath (~> 1.0)
5656
aws-sdk-kms (1.24.0)
5757
aws-sdk-core (~> 3, >= 3.61.1)
5858
aws-sigv4 (~> 1.1)
59-
aws-sdk-s3 (1.46.0)
59+
aws-sdk-s3 (1.48.0)
6060
aws-sdk-core (~> 3, >= 3.61.1)
6161
aws-sdk-kms (~> 1)
6262
aws-sigv4 (~> 1.1)
6363
aws-sigv4 (1.1.0)
6464
aws-eventstream (~> 1.0, >= 1.0.2)
65-
bcrypt (3.1.12)
65+
bcrypt (3.1.13)
6666
bindex (0.5.0)
67-
bootsnap (1.4.4)
67+
bootsnap (1.4.5)
6868
msgpack (~> 1.0)
6969
builder (3.2.3)
7070
byebug (11.0.1)
71-
capybara (3.27.0)
71+
capybara (3.28.0)
7272
addressable
7373
mini_mime (>= 0.1.3)
7474
nokogiri (~> 1.8)
@@ -87,15 +87,15 @@ GEM
8787
crass (1.0.4)
8888
declarative (0.0.10)
8989
declarative-option (0.1.0)
90-
devise (4.6.2)
90+
devise (4.7.1)
9191
bcrypt (~> 3.0)
9292
orm_adapter (~> 0.1)
93-
railties (>= 4.1.0, < 6.0)
93+
railties (>= 4.1.0)
9494
responders
9595
warden (~> 1.2.3)
96-
dotenv (2.7.4)
97-
dotenv-rails (2.7.4)
98-
dotenv (= 2.7.4)
96+
dotenv (2.7.5)
97+
dotenv-rails (2.7.5)
98+
dotenv (= 2.7.5)
9999
railties (>= 3.2, < 6.1)
100100
erubi (1.8.0)
101101
execjs (2.7.0)
@@ -152,15 +152,15 @@ GEM
152152
mini_portile2 (2.4.0)
153153
minitest (5.11.3)
154154
mono_logger (1.1.0)
155-
msgpack (1.2.10)
155+
msgpack (1.3.1)
156156
multi_json (1.13.1)
157157
multi_xml (0.6.0)
158158
multipart-post (2.1.1)
159159
mustermann (1.0.3)
160-
nio4r (2.4.0)
160+
nio4r (2.5.1)
161161
nokogiri (1.10.4)
162162
mini_portile2 (~> 2.4.0)
163-
oj (3.8.1)
163+
oj (3.9.1)
164164
orm_adapter (0.5.0)
165165
os (1.0.1)
166166
parallel (1.17.0)
@@ -177,10 +177,10 @@ GEM
177177
method_source (~> 0.9.0)
178178
pry-rails (0.3.9)
179179
pry (>= 0.10.4)
180-
public_suffix (3.1.1)
181-
puma (4.0.1)
180+
public_suffix (4.0.1)
181+
puma (4.1.0)
182182
nio4r (~> 2.0)
183-
pundit (2.0.1)
183+
pundit (2.1.0)
184184
activesupport (>= 3.0.0)
185185
rack (2.0.7)
186186
rack-cors (1.0.3)
@@ -204,7 +204,7 @@ GEM
204204
rails-dom-testing (2.0.3)
205205
activesupport (>= 4.2.0)
206206
nokogiri (>= 1.6)
207-
rails-html-sanitizer (1.0.4)
207+
rails-html-sanitizer (1.2.0)
208208
loofah (~> 2.2, >= 2.2.2)
209209
railties (5.2.3)
210210
actionpack (= 5.2.3)
@@ -213,7 +213,7 @@ GEM
213213
rake (>= 0.8.7)
214214
thor (>= 0.19.0, < 2.0)
215215
rainbow (3.0.0)
216-
rake (12.3.2)
216+
rake (12.3.3)
217217
rb-fsevent (0.10.3)
218218
rb-inotify (0.10.0)
219219
ffi (~> 1.0)
@@ -228,9 +228,9 @@ GEM
228228
declarative (< 0.1.0)
229229
declarative-option (< 0.2.0)
230230
uber (< 0.2.0)
231-
responders (2.4.1)
232-
actionpack (>= 4.2.0, < 6.0)
233-
railties (>= 4.2.0, < 6.0)
231+
responders (3.0.0)
232+
actionpack (>= 5.0)
233+
railties (>= 5.0)
234234
resque (2.0.0)
235235
mono_logger (~> 1.0)
236236
multi_json (~> 1.0)
@@ -305,7 +305,7 @@ GEM
305305
activemodel (>= 5.0)
306306
bindex (>= 0.4.0)
307307
railties (>= 5.0)
308-
webmock (3.6.2)
308+
webmock (3.7.0)
309309
addressable (>= 2.3.6)
310310
crack (>= 0.3.2)
311311
hashdiff (>= 0.4.0, < 2.0.0)
@@ -319,8 +319,8 @@ PLATFORMS
319319
ruby
320320

321321
DEPENDENCIES
322-
addressable (~> 2.6)
323-
aws-sdk-s3 (~> 1.46)
322+
addressable (~> 2.7)
323+
aws-sdk-s3 (~> 1.48)
324324
bootsnap (>= 1.3.1)
325325
byebug
326326
capybara
@@ -332,11 +332,11 @@ DEPENDENCIES
332332
httparty
333333
jwt (~> 2.2)
334334
listen (~> 3.1)
335-
oj (~> 3.8)
335+
oj (~> 3.9)
336336
pg (~> 1.1)
337337
postmark-rails
338338
pry-rails
339-
puma (~> 4.0)
339+
puma (~> 4.1)
340340
pundit
341341
rack-cors
342342
rails (~> 5.2.3)
@@ -353,7 +353,7 @@ DEPENDENCIES
353353
tzinfo-data
354354
uglifier (>= 1.3.0)
355355
web-console (>= 3.3.0)
356-
webmock (~> 3.6)
356+
webmock (~> 3.7)
357357

358358
RUBY VERSION
359359
ruby 2.6.3p62

README.md

Lines changed: 74 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,22 @@
22

33
# web-monitoring-db
44

5-
This repository is the database and API underlying the EDGI [Web Monitoring Project](https://github.com/edgi-govdata-archiving/web-monitoring).
5+
This repository is the database and API underlying the EDGI [Web Monitoring Project](https://github.com/edgi-govdata-archiving/web-monitoring). It’s a Rails app that:
66

7-
It’s a Rails app that:
7+
- Acts as a database of monitored pages and captured versions of those pages over time.
88

9-
- Acts as a database of monitored pages and revisions that have been made to them
10-
- Allows other services to add new tracked pages/versions (we are currently focused on Versionista, but this database will soon host data from other sources, such as the Internet Archive)
11-
- Provides an API to get that version data and allow analysts or other automated tools to annotate those versions with metadata
9+
*(The application does not record new versions itself, but relies on importing data from external services, like [the Internet Archive](https://archive.org) or [Versionista](https://versionista.com). See [“How Data Gets Loaded”](#how-data-gets-loaded) below for more.)*
10+
11+
- Provides an API to get that page and version data, and to allow analysts or other automated tools to annotate those versions with metadata about what has changed from version to version.
12+
13+
For more about how data is modeled in this project, see [“Data Model”](#data-model) below.
14+
15+
API documentation is available from the homepage of the application, e.g. by pointing your browser to http://localhost:3000/ or https://api.monitoring.envirodatagov.org. It’s generated from our OpenAPI docs in [`swagger.yml`](./swagger.yml).
16+
17+
We maintain a publicly available *staging server* at https://api-staging.monitoring.envirodatagov.org that you can test against. It runs the latest code and has non-production data — it’s safe to modify or post new versions or annotations to, but you should not rely on that data sticking around; it may get reset at any time. **For access, ask for an account on Slack or use the public user credentials:**
18+
19+
- Username: `public.access@envirodatagov.org`
20+
- Password: `PUBLIC_ACCESS`
1221

1322

1423
## Installation
@@ -175,7 +184,7 @@ It’s a Rails app that:
175184
- `analysis`: Auto-analyze changes between versions and create annotations with the results.
176185

177186

178-
## Manual Postgres Setup
187+
### Manual Postgres Setup
179188

180189
If you don’t want to populate your DB with seed data, want to manage creation of the database yourself, or otherwise manually do database setup, run any of the following commands as desired instead of `rake db:setup`:
181190

@@ -197,7 +206,7 @@ User.create(
197206
```
198207

199208

200-
## Docker
209+
### Docker
201210

202211
The Dockerfile runs the rails server on port 3000 in the container. To build
203212
and run:
@@ -212,6 +221,64 @@ docker run -p 6379:6379 envirodgi/db-import-worker -e <ENVIRONMENT VARIABLES> .
212221
Point your browser or ``curl`` at ``http://localhost:3000``.
213222

214223

224+
## Data Model
225+
226+
The database models three main types of data:
227+
228+
- **Pages**, which represent a page on the internet. Pages are identified by a unique ID rather than their URL because pages can move or be available from multiple URLs. *(Note: we don't actually model that yet, though! See [#492](https://github.com/edgi-govdata-archiving/web-monitoring-db/issues/492) for more.)*
229+
230+
- **Versions**, which represent a particular page at a particular point in time. We use the term “version” instead of others more common in the archival space because we attempt to only represent *different* versions. That is, if a page changed on Wednesday and we captured copies of it on Monday, Tuesday, and Wednesday, we only make version records for Monday and Wednesday (because Tuesday was the same as Monday).
231+
232+
*(Note: because of technical issues around imported data, we often store more versions than we should according to the above definition [e.g. we might still have a record for Tuesday]. Versions have a `different` field that indicates whether a version is different from the previous one, and the API only returns versions that are `different` unless you explicitly request otherwise.)*
233+
234+
- **Annotations**, which represent an analysis about what’s changed between any two *versions* of a *page*. Annotations have a specialized `priority` and `significance`, which are numbers between 0 and 1, an `author`, indicating who made the analysis (it could be a bot account), and an `annotation` field, which is a JSON object with no specified structure (inside this field, annotations can include any data desired).
235+
236+
There are several other kinds of objects, but they are subservient to the ones above:
237+
238+
- **Changes**, which serve to connect any two *versions* of a *page*. *Annotations* are actually connected to *changes*, rather than directly to two *versions*. You can also generate diffs for a given *change*.
239+
240+
- **Tags**, which can be applied to pages. They help sort and categorize things. Most tags are manually applied, but the application auto-generates a few:
241+
- `domain:<domain name>`, e.g. `domain:www.epa.gov` for a page at `https://www.epa.gov/citizen-science`
242+
- `2l-domain:<second-level domain name>` e.g. `2l-domain:epa.gov` for a page at `https://www.epa.gov/citizen-science`
243+
244+
- **Maintainers**, which can be applied to pages. They represent organizations that maintain a given page. For example, the page at `https://www.epa.gov/citizen-science` is maintained by `EPA`.
245+
246+
- **Imports** model requests to import new data and the results of the import operation.
247+
248+
- **Users** model people (both human and bots) who can view, import, and annotate data. You currently have to have a user account to do anything in the application, though we hope accounts will not be needed to view public data in the future.
249+
250+
Actual database schemas for each of these tables is listed in [`db/schema.rb`](./db/schema.rb).
251+
252+
253+
### How Data Gets Loaded
254+
255+
The web-monitoring-db project does not actually monitor or scrape pages on the web. Instead, we rely on importing data from other services, like [the Internet Archive](https://archive.org). Each day, a script queries other services for historical snapshots and sends the results to the `/api/v0/imports` endpoint.
256+
257+
Most of the data sent to `/api/v0/imports` matches up directly with the structure of the [`Version` model](./db/schema.rb). However, the `uri` field in an import is treated specially.
258+
259+
When new page or version data is imported, the `uri` field points to a location where the raw HTTP response body can be retrieved. If the `uri` host matches one of the values in the [`ALLOWED_ARCHIVE_HOSTS` environment variable](./.env.example), the version record that gets added to the database will simply point to that external location as a source of raw response data. Otherwise, the application downloads the data from `uri` and stores it in its `FileStorage`.
260+
261+
The intent is to make sure data winds up at a reliably available location, ensuring that anyone who can access the API can also access the raw response body for any version. Hosts should be listed in `ALLOWED_ARCHIVE_HOSTS` if they meet this criteria better than the application’s own file storage. The application’s storage area can be the local disk or it can be S3, depending on configuration. The component can take pluggable configurations, so we can support other storage types or locations in the future.
262+
263+
You can see more about this process in:
264+
- The overview repo’s [“architecture” document](https://github.com/edgi-govdata-archiving/web-monitoring/blob/master/ARCHITECTURE.md#web-page-snapshottingcapturing-workflow)
265+
- The [import job code](./app/jobs/import_versions_job.rb), where imports are processed.
266+
- The [`Archiver` module code](./lib/archiver/archiver.rb), where raw HTTP response data is saved.
267+
268+
269+
### File Storage
270+
271+
The application needs to store files for several different purposes (storing raw import data, archiving HTTP response bodies as described in the previous section, specialized logs, etc). To do this, it uses the [`FileStorage`](https://github.com/edgi-govdata-archiving/web-monitoring-db/tree/master/lib/file_storage) module, which has different implementations for different types of storage, such as [the local disk](https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/master/lib/file_storage/local_file.rb) or [Amazon S3](https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/master/lib/file_storage/s3.rb).
272+
273+
At current, the application creates two `FileStorage` instances:
274+
275+
1. “Archival storage” is used to store raw HTTP response bodies for each version of a page. See the [“how data gets loaded” section](#how-data-gets-loaded) for more details. Under a default configuration, this is your local disk in development and S3 in production. You can configure the S3 bucket used for it with the `AWS_ARCHIVE_BUCKET` environment variable. **Everything in this storage area is publicly available.**
276+
277+
2. “Working storage” is used to store internal data, such as raw import data and import logs. Under a default configuration, this is your local disk in development and S3 in production. You can configure the S3 bucket used for it with the `AWS_WORKING_BUCKET` environment variable. **Everything in this storage area should be considered private and you should not expose it to the public web.**
278+
279+
3. For historical reasons, EDGI’s deployment includes a third S3 bucket that is not directly accessed by the application. It’s where we store HTTP response bodies collected from [Versionista](https://versionista.com), a service we previously used for scraping government web pages. You can see it listed in [the example settings for `ALLOWED_ARCHIVE_HOSTS`](https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/master/.env.example).
280+
281+
215282
## Code of Conduct
216283
217284
This repository falls under EDGI's [Code of Conduct](https://github.com/edgi-govdata-archiving/overview/blob/master/CONDUCT.md).

0 commit comments

Comments
 (0)