Skip to content

Commit afa4988

Browse files
authored
Add PageUrl model to describe pages with multiple URLs (#793)
This finally implements something we've wanted to do for a while: separate the concept of a "page" from that of a URL. Pages now have many URLs, which live in the `page_urls` table (or the `PageUrl` model). Each PageUrl can also have notes and a timeframe during which it is valid. Both are filled in manually — these are anecdotal info for and by analysts, not something calculated by the system. This PR also includes functionality to merge pages, since we know we already have several duplicate pages that only exist because we needed to track a page at different URLs over time. There will be a lot of duplicate pages to remediate after we deploy this. Finally, this makes a small but major change to the relationship between Pages and Versions: Versions no longer *have* belong to a Page (although they will continue to do so in the vast majority of circumstances). As we move into a multiple-URLs-per-page world, we know we have versions of a particular URL that no longer belong to a page (for example, because the page was moved, and the old URL now produces 404 responses even though the page is available elsewhere) and, in order to produce a coherent timeline, should be removed from the page (and don't have another page they should really belong to). This change makes that possible. Conceptually, we should really stop thinking of Versions as "versions of a page" and more like "snapshots of a URL," which is a way of thinking we've been slowly moving towards for a while.
1 parent ae86f9b commit afa4988

21 files changed

+1102
-49
lines changed

app/controllers/api/v0/pages_controller.rb

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,16 @@ def index
7171
end
7272

7373
def show
74-
page = Page.find(params[:id])
74+
begin
75+
page = Page.find(params[:id])
76+
rescue ActiveRecord::RecordNotFound
77+
merge = MergedPage.find(params[:id])
78+
redirect_to(
79+
api_v0_page_url(merge.target_uuid),
80+
status: :permanent_redirect
81+
) and return
82+
end
83+
7584
data = page.as_json(include: [:maintainers, :tags])
7685
if should_allow_versions
7786
data['versions'] = page.versions.where(different: true).as_json
@@ -155,11 +164,12 @@ def page_collection
155164

156165
if params[:url]
157166
query = params[:url]
167+
collection = collection.joins(:urls)
158168
if query.include? '*'
159169
query = query.gsub('%', '\%').gsub('_', '\_').tr('*', '%')
160-
collection = collection.where('url LIKE ?', query)
170+
collection = collection.where('page_urls.url LIKE ?', query)
161171
else
162-
collection = collection.where(url: query)
172+
collection = collection.where('page_urls.url = ?', query)
163173
end
164174
end
165175

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
class Api::V0::UrlsController < Api::V0::ApiController
2+
def index
3+
urls = page.urls.order('page_urls.to_time DESC')
4+
5+
render json: {
6+
links: { page: api_v0_page_url(page) },
7+
data: urls
8+
}
9+
end
10+
11+
def show
12+
@page_url ||= page.urls.find(params[:id])
13+
render json: {
14+
links: {
15+
page: api_v0_page_url(page),
16+
page_urls: api_v0_page_urls_url(page)
17+
},
18+
data: @page_url
19+
}
20+
end
21+
22+
def create
23+
@page_url = page.urls.create!(url_params)
24+
show
25+
rescue ActiveRecord::RecordNotUnique
26+
raise Api::ResourceExistsError, 'This page already has the given URL and timeframe'
27+
end
28+
29+
def update
30+
updates = url_params
31+
if updates.key?(:url)
32+
raise Api::UnprocessableError, 'You cannot change a URL\'s `url`'
33+
end
34+
35+
@page_url ||= page.urls.find(params[:id])
36+
@page_url.update(url_params)
37+
show
38+
end
39+
40+
def destroy
41+
@page_url ||= page.urls.find(params[:id])
42+
# You cannot delete the canonical URL.
43+
if @page_url.url == page.url
44+
raise Api::UnprocessableError, 'You cannot remove the page\'s canonical URL'
45+
else
46+
@page_url.destroy
47+
redirect_to(api_v0_page_urls_url(page))
48+
end
49+
end
50+
51+
protected
52+
53+
def page
54+
@page ||= Page.find(params[:page_id])
55+
end
56+
57+
def url_params
58+
result = params
59+
.require(:page_url)
60+
.permit(:url, :from_time, :to_time, :notes)
61+
62+
result.slice('from_time', 'to_time').each do |key, value|
63+
result[key] = parse_time(key, value)
64+
end
65+
66+
result
67+
end
68+
69+
def parse_time(field, time_input)
70+
return if time_input.nil?
71+
72+
Time.parse(time_input)
73+
rescue ArgumentError
74+
raise Api::UnprocessableError, "`#{field}` was not a valid time or `null`"
75+
end
76+
end

app/jobs/import_versions_job.rb

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,10 +173,13 @@ def page_for_record(record, create: true, row:)
173173
validate_kind!([String], record, 'page_url')
174174
validate_kind!([Array, NilClass], record, 'page_maintainers')
175175
validate_kind!([Array, NilClass], record, 'page_tags')
176+
validate_present!(record, 'capture_time')
177+
validate_kind!([String], record, 'capture_time')
176178

177179
url = record['page_url']
178180

179-
existing_page = Page.find_by_url(url)
181+
capture_time = Time.parse(record['capture_time'])
182+
existing_page = Page.find_by_url(url, at_time: capture_time)
180183
page = if existing_page
181184
log(object: existing_page, operation: :found, row: row)
182185
existing_page
@@ -191,6 +194,20 @@ def page_for_record(record, create: true, row:)
191194
(record['page_maintainers'] || []).each {|name| page.add_maintainer(name)}
192195
(record['page_tags'] || []).each {|name| page.add_tag(name)}
193196

197+
# If the page was not an *exact* URL match, add the URL to the page.
198+
# (`page.find_by_url` used above will match by `url_key`, too.)
199+
page.urls.find_or_create_by(url: url)
200+
# TODO: Add URLs from redirects automatically. The main blocker for
201+
# this at the moment is the following situation:
202+
#
203+
# Two pages, A=https://example.com/about
204+
# B=https://example.com/about/locations
205+
# Page B is removed, but instead of returning a 404 or 403 status
206+
# code, it starts redirecting to Page A.
207+
#
208+
# This is unfortunately not uncommon, so we need some heuristics to
209+
# account for it, e.g. URL does not already belong to another page.
210+
194211
page
195212
end
196213

app/models/concerns/taggable.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@ module Taggable
22
extend ActiveSupport::Concern
33

44
included do
5-
has_many :taggings, as: :taggable, foreign_key: 'taggable_uuid'
6-
has_many :tags, through: :taggings
5+
has_many :taggings, as: :taggable, foreign_key: 'taggable_uuid', dependent: :delete_all
6+
has_many :tags, through: :taggings, dependent: nil
77
end
88

99
def add_tag(tag)

app/models/merged_page.rb

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# MergedPage keeps track of pages that were merged into others so we can
2+
# support old links by redirecting to the page they were merged into.
3+
# - The primary key is the ID of the page that was merged and removed
4+
# - `target_uuid` is the ID of the page it was merged into
5+
# - `audit_data` is any useful JSON data about the page (usually a frozen
6+
# copy of its attributes).
7+
class MergedPage < ApplicationRecord
8+
include UuidPrimaryKey
9+
10+
belongs_to :target,
11+
class_name: 'Page',
12+
foreign_key: :target_uuid,
13+
required: true
14+
end

app/models/page.rb

Lines changed: 148 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,11 @@ class Page < ApplicationRecord
1515
has_many :versions,
1616
-> { order(capture_time: :desc) },
1717
foreign_key: 'page_uuid',
18-
inverse_of: :page
18+
inverse_of: :page,
19+
# You must explcitly dissociate or move versions before destroying.
20+
# It's OK for a version to be orphaned from all pages, but we want
21+
# to make sure that's an intentional action and not accidental.
22+
dependent: :restrict_with_error
1923
has_one :earliest,
2024
(lambda do
2125
# DISTINCT ON requires the first ORDER to be the distinct column(s)
@@ -42,9 +46,18 @@ class Page < ApplicationRecord
4246
class_name: 'Version'
4347
# This needs a funky name because `changes` is a an activerecord method
4448
has_many :tracked_changes, through: :versions
49+
has_many :urls,
50+
class_name: 'PageUrl',
51+
foreign_key: 'page_uuid',
52+
inverse_of: :page,
53+
dependent: :delete_all
54+
has_many :current_urls,
55+
-> { current },
56+
class_name: 'PageUrl',
57+
foreign_key: 'page_uuid'
4558

46-
has_many :maintainerships, foreign_key: :page_uuid
47-
has_many :maintainers, through: :maintainerships
59+
has_many :maintainerships, foreign_key: :page_uuid, dependent: :delete_all
60+
has_many :maintainers, through: :maintainerships, dependent: nil
4861

4962
scope(:needing_status_update, lambda do
5063
# NOTE: pages.status can be NULL, so use DISTINCT FROM instead of <>/!= to compare.
@@ -56,14 +69,30 @@ class Page < ApplicationRecord
5669
before_create :ensure_url_key
5770
after_create :ensure_domain_and_news_tags
5871
before_save :normalize_url
72+
after_save :ensure_page_urls
5973
validate :url_must_have_domain
6074
validates :status,
6175
allow_nil: true,
6276
inclusion: { in: 100...600, message: 'is not between 100 and 599' }
6377

64-
def self.find_by_url(raw_url)
78+
def self.find_by_url(raw_url, at_time: nil)
6579
url = normalize_url(raw_url)
66-
Page.find_by(url: url) || Page.find_by(url_key: create_url_key(url))
80+
81+
current = PageUrl.eager_load(:page).current(at_time)
82+
found = current.find_by(url: url)
83+
return found.page if found
84+
85+
key = PageUrl.create_url_key(url)
86+
found = current.find_by(url_key: key)
87+
return found.page if found
88+
89+
with_pages = PageUrl.eager_load(:page).order(to_time: :desc)
90+
found = with_pages.find_by(url: url) ||
91+
with_pages.find_by(url_key: key)
92+
return found.page if found
93+
94+
# TODO: remove this fallback when data is migrated over to Page.urls.
95+
Page.find_by(url: url) || Page.find_by(url_key: key)
6796
end
6897

6998
def self.normalize_url(url)
@@ -76,10 +105,6 @@ def self.normalize_url(url)
76105
end
77106
end
78107

79-
def self.create_url_key(url)
80-
Surt.surt(url)
81-
end
82-
83108
def add_maintainer(maintainer)
84109
unless maintainer.is_a?(Maintainer)
85110
maintainer = Maintainer.find_or_create_by(name: maintainer)
@@ -132,7 +157,7 @@ def as_json(options = {})
132157
end
133158

134159
def update_url_key
135-
update(url_key: Page.create_url_key(url))
160+
update(url_key: PageUrl.create_url_key(url))
136161
end
137162

138163
def ensure_domain_and_news_tags
@@ -141,20 +166,89 @@ def ensure_domain_and_news_tags
141166
self.add_tag('news') if news?
142167
end
143168

169+
# Keep page creation relatively simple by automatically creating a PageUrl
170+
# for the page's current URL when creating a page. (Page#url is the current
171+
# canonical Url of the page, the true list of URLs associated with the page
172+
# should always be the list of PageUrls in Page#urls).
173+
def ensure_page_urls
174+
urls.find_or_create_by!(url: url) if saved_change_to_attribute?('url')
175+
end
176+
144177
def update_status
145178
new_status = calculate_status
146179
self.update(status: new_status) unless new_status.zero?
147180
self.status
148181
end
149182

183+
def merge(*other_pages)
184+
Page.transaction do
185+
first_version_time = nil
186+
other_pages.each do |other|
187+
audit_data = other.create_audit_record
188+
189+
# Move versions from other page.
190+
other.versions.to_a.each do |version|
191+
self.versions << version
192+
if first_version_time.nil? || first_version_time > version.capture_time
193+
first_version_time = version.capture_time
194+
end
195+
end
196+
# The above doesn't update the source page's `versions`, so reload.
197+
other.versions.reload
198+
199+
# Copy other attributes from other page.
200+
other.tags.each {|tag| add_tag(tag)}
201+
other.maintainers.each {|maintainer| add_maintainer(maintainer)}
202+
other.urls.each do |page_url|
203+
# TODO: it would be slightly nicer to collapse/merge PageUrls with
204+
# overlapping or intersecting time ranges here.
205+
# NOTE: we have to be careful not to trip the DB's uniqueness
206+
# constraints here or we hose the transaction.
207+
just_created = false
208+
merged_url = urls.find_or_create_by(
209+
url: page_url.url,
210+
from_time: page_url.from_time,
211+
to_time: page_url.to_time
212+
) do |new_url|
213+
new_url.notes = page_url.notes
214+
just_created = true
215+
end
216+
217+
unless just_created
218+
new_notes = [merged_url.notes, page_url.notes].compact.join(' ')
219+
merged_url.update(notes: new_notes)
220+
end
221+
end
222+
223+
# Keep a record so we can redirect requests for the merged page.
224+
# Delete the actual page record rather than keep it around so we don't
225+
# have to worry about messy partial indexes and querying around URLs.
226+
MergedPage.create!(uuid: other.uuid, target: self, audit_data: audit_data)
227+
# If the page we're removing was previously a merge target, update
228+
# its references.
229+
MergedPage.where(target_uuid: other.uuid).update_all(target_uuid: self.uuid)
230+
# And finally drop the merged page.
231+
other.destroy!
232+
233+
audit_json = Oj.dump(audit_data, mode: :rails)
234+
Rails.logger.info("Merged page #{other.uuid} into #{uuid}. Old data: #{audit_json}")
235+
end
236+
237+
# Recalculate denormalized attributes
238+
update_page_title(first_version_time)
239+
update_versions_different(first_version_time)
240+
# TODO: it might be neat to clean up overlapping URL timeframes
241+
end
242+
end
243+
150244
protected
151245

152246
def news?
153247
url.include?('/news') || url.include?('/blog') || url.include?('/press')
154248
end
155249

156250
def ensure_url_key
157-
self.url_key ||= Page.create_url_key(url)
251+
self.url_key ||= PageUrl.create_url_key(url)
158252
end
159253

160254
def normalize_url
@@ -211,4 +305,47 @@ def calculate_status(relative_to: nil)
211305
success_rate = 1 - (error_time.to_f / total_time)
212306
success_rate < STATUS_SUCCESS_THRESHOLD ? latest_error : 200
213307
end
308+
309+
def update_page_title(from_time = nil)
310+
candidates = versions.reorder(capture_time: :desc)
311+
candidates = candidates.where('capture_time >= ?', from_time) if from_time
312+
candidates.each do |version|
313+
break if version.sync_page_title
314+
end
315+
end
316+
317+
# TODO: figure out whether there's a reasonable way to merge this logic with
318+
# `Version#update_different_attribute`.
319+
def update_versions_different(from_time)
320+
previous_hash = nil
321+
candidates = versions
322+
.where('capture_time >= ?', from_time)
323+
.reorder(capture_time: :asc)
324+
325+
candidates.each do |version|
326+
if previous_hash.nil?
327+
previous_hash = version.previous(different: false).try(:version_hash)
328+
end
329+
330+
version.update(different: version.version_hash != previous_hash)
331+
previous_hash = version.version_hash
332+
end
333+
end
334+
335+
# Creates a hash representing the current state of a page. Used for logging
336+
# and other audit related purposes when deleting/merging pages.
337+
def create_audit_record
338+
# URLs are entirely unique to the page, so have to be recorded
339+
# completely rather than referenced by ID.
340+
urls_data = urls.collect do |page_url|
341+
page_url.attributes.slice('url', 'from_time', 'to_time', 'notes')
342+
end
343+
344+
attributes.merge({
345+
'tags' => tags.collect(&:name),
346+
'maintainers' => maintainers.collect(&:name),
347+
'versions' => versions.collect(&:uuid),
348+
'urls' => urls_data
349+
})
350+
end
214351
end

0 commit comments

Comments
 (0)