Skip to content

Add PageUrl model to describe pages with multiple URLs #793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Dec 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
f455f5b
Add hack to fix ActiveRecord schema.rb bug
Mr0grog Dec 5, 2020
a7697fa
Add migration and new `PageUrl` model
Mr0grog Dec 5, 2020
baeb01c
Add tests for/clean up logic in `Page.find_by_url`
Mr0grog Dec 5, 2020
4201ba6
Update `Page.urls` *any* time `Page.url` changes
Mr0grog Dec 7, 2020
6e4d4f8
Fall back to old behavior for `Page.find_by_url`
Mr0grog Dec 7, 2020
90d63a3
Add data migration for populating `page.urls`
Mr0grog Dec 7, 2020
1284a39
Add `PageUrl` tests, make `PageUrl#url` immutable
Mr0grog Dec 7, 2020
a54f5ba
OBEY RUBOCOP
Mr0grog Dec 7, 2020
8712e8f
Add page merging
Mr0grog Dec 9, 2020
210f249
OBEY RUBOCOP
Mr0grog Dec 9, 2020
2e81808
Fix maddening value casting bug in ActiveRecord
Mr0grog Dec 10, 2020
710bd0b
Fail harder when ensuring we have PageUrl records
Mr0grog Dec 10, 2020
da25c6d
Fix tests on CI
Mr0grog Dec 10, 2020
4e4276c
Try fixing CI tests again
Mr0grog Dec 10, 2020
0fa85ed
Add rake task to merge pages
Mr0grog Dec 10, 2020
5df107a
Add new page URLs during import
Mr0grog Dec 10, 2020
7f23115
Wrap `Page#merge` in transaction
Mr0grog Dec 11, 2020
1a7530f
Allow versions to be orphaned from pages
Mr0grog Dec 11, 2020
b45909b
Delete merged pages, list in a table for redirects
Mr0grog Dec 11, 2020
248649d
Redirect API requests for merged page to new page
Mr0grog Dec 11, 2020
9dd8f43
OBEY RUBOCOP
Mr0grog Dec 11, 2020
0b1bcf4
Expose PageUrls in API at `/api/v0/pages/:id/urls`
Mr0grog Dec 11, 2020
d1582cd
Query URLs in API from page_urls instead of page
Mr0grog Dec 11, 2020
60a23f5
When importing prefer URLs valid at *capture* time
Mr0grog Dec 11, 2020
b98c19a
Fix foreign key bug when merging, log data as JSON
Mr0grog Dec 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions app/controllers/api/v0/pages_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,16 @@ def index
end

def show
page = Page.find(params[:id])
begin
page = Page.find(params[:id])
rescue ActiveRecord::RecordNotFound
merge = MergedPage.find(params[:id])
redirect_to(
api_v0_page_url(merge.target_uuid),
status: :permanent_redirect
) and return
end

data = page.as_json(include: [:maintainers, :tags])
if should_allow_versions
data['versions'] = page.versions.where(different: true).as_json
Expand Down Expand Up @@ -155,11 +164,12 @@ def page_collection

if params[:url]
query = params[:url]
collection = collection.joins(:urls)
if query.include? '*'
query = query.gsub('%', '\%').gsub('_', '\_').tr('*', '%')
collection = collection.where('url LIKE ?', query)
collection = collection.where('page_urls.url LIKE ?', query)
else
collection = collection.where(url: query)
collection = collection.where('page_urls.url = ?', query)
end
end

Expand Down
76 changes: 76 additions & 0 deletions app/controllers/api/v0/urls_controller.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
class Api::V0::UrlsController < Api::V0::ApiController
def index
urls = page.urls.order('page_urls.to_time DESC')

render json: {
links: { page: api_v0_page_url(page) },
data: urls
}
end

def show
@page_url ||= page.urls.find(params[:id])
render json: {
links: {
page: api_v0_page_url(page),
page_urls: api_v0_page_urls_url(page)
},
data: @page_url
}
end

def create
@page_url = page.urls.create!(url_params)
show
rescue ActiveRecord::RecordNotUnique
raise Api::ResourceExistsError, 'This page already has the given URL and timeframe'
end

def update
updates = url_params
if updates.key?(:url)
raise Api::UnprocessableError, 'You cannot change a URL\'s `url`'
end

@page_url ||= page.urls.find(params[:id])
@page_url.update(url_params)
show
end

def destroy
@page_url ||= page.urls.find(params[:id])
# You cannot delete the canonical URL.
if @page_url.url == page.url
raise Api::UnprocessableError, 'You cannot remove the page\'s canonical URL'
else
@page_url.destroy
redirect_to(api_v0_page_urls_url(page))
end
end

protected

def page
@page ||= Page.find(params[:page_id])
end

def url_params
result = params
.require(:page_url)
.permit(:url, :from_time, :to_time, :notes)

result.slice('from_time', 'to_time').each do |key, value|
result[key] = parse_time(key, value)
end

result
end

def parse_time(field, time_input)
return if time_input.nil?

Time.parse(time_input)
rescue ArgumentError
raise Api::UnprocessableError, "`#{field}` was not a valid time or `null`"
end
end
19 changes: 18 additions & 1 deletion app/jobs/import_versions_job.rb
Original file line number Diff line number Diff line change
Expand Up @@ -173,10 +173,13 @@ def page_for_record(record, create: true, row:)
validate_kind!([String], record, 'page_url')
validate_kind!([Array, NilClass], record, 'page_maintainers')
validate_kind!([Array, NilClass], record, 'page_tags')
validate_present!(record, 'capture_time')
validate_kind!([String], record, 'capture_time')

url = record['page_url']

existing_page = Page.find_by_url(url)
capture_time = Time.parse(record['capture_time'])
existing_page = Page.find_by_url(url, at_time: capture_time)
page = if existing_page
log(object: existing_page, operation: :found, row: row)
existing_page
Expand All @@ -191,6 +194,20 @@ def page_for_record(record, create: true, row:)
(record['page_maintainers'] || []).each {|name| page.add_maintainer(name)}
(record['page_tags'] || []).each {|name| page.add_tag(name)}

# If the page was not an *exact* URL match, add the URL to the page.
# (`page.find_by_url` used above will match by `url_key`, too.)
page.urls.find_or_create_by(url: url)
# TODO: Add URLs from redirects automatically. The main blocker for
# this at the moment is the following situation:
#
# Two pages, A=https://example.com/about
# B=https://example.com/about/locations
# Page B is removed, but instead of returning a 404 or 403 status
# code, it starts redirecting to Page A.
#
# This is unfortunately not uncommon, so we need some heuristics to
# account for it, e.g. URL does not already belong to another page.

page
end

Expand Down
4 changes: 2 additions & 2 deletions app/models/concerns/taggable.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ module Taggable
extend ActiveSupport::Concern

included do
has_many :taggings, as: :taggable, foreign_key: 'taggable_uuid'
has_many :tags, through: :taggings
has_many :taggings, as: :taggable, foreign_key: 'taggable_uuid', dependent: :delete_all
has_many :tags, through: :taggings, dependent: nil
end

def add_tag(tag)
Expand Down
14 changes: 14 additions & 0 deletions app/models/merged_page.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# MergedPage keeps track of pages that were merged into others so we can
# support old links by redirecting to the page they were merged into.
# - The primary key is the ID of the page that was merged and removed
# - `target_uuid` is the ID of the page it was merged into
# - `audit_data` is any useful JSON data about the page (usually a frozen
# copy of its attributes).
class MergedPage < ApplicationRecord
include UuidPrimaryKey

belongs_to :target,
class_name: 'Page',
foreign_key: :target_uuid,
required: true
end
159 changes: 148 additions & 11 deletions app/models/page.rb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,11 @@ class Page < ApplicationRecord
has_many :versions,
-> { order(capture_time: :desc) },
foreign_key: 'page_uuid',
inverse_of: :page
inverse_of: :page,
# You must explcitly dissociate or move versions before destroying.
# It's OK for a version to be orphaned from all pages, but we want
# to make sure that's an intentional action and not accidental.
dependent: :restrict_with_error
has_one :earliest,
(lambda do
# DISTINCT ON requires the first ORDER to be the distinct column(s)
Expand All @@ -42,9 +46,18 @@ class Page < ApplicationRecord
class_name: 'Version'
# This needs a funky name because `changes` is a an activerecord method
has_many :tracked_changes, through: :versions
has_many :urls,
class_name: 'PageUrl',
foreign_key: 'page_uuid',
inverse_of: :page,
dependent: :delete_all
has_many :current_urls,
-> { current },
class_name: 'PageUrl',
foreign_key: 'page_uuid'

has_many :maintainerships, foreign_key: :page_uuid
has_many :maintainers, through: :maintainerships
has_many :maintainerships, foreign_key: :page_uuid, dependent: :delete_all
has_many :maintainers, through: :maintainerships, dependent: nil

scope(:needing_status_update, lambda do
# NOTE: pages.status can be NULL, so use DISTINCT FROM instead of <>/!= to compare.
Expand All @@ -56,14 +69,30 @@ class Page < ApplicationRecord
before_create :ensure_url_key
after_create :ensure_domain_and_news_tags
before_save :normalize_url
after_save :ensure_page_urls
validate :url_must_have_domain
validates :status,
allow_nil: true,
inclusion: { in: 100...600, message: 'is not between 100 and 599' }

def self.find_by_url(raw_url)
def self.find_by_url(raw_url, at_time: nil)
url = normalize_url(raw_url)
Page.find_by(url: url) || Page.find_by(url_key: create_url_key(url))

current = PageUrl.eager_load(:page).current(at_time)
found = current.find_by(url: url)
return found.page if found

key = PageUrl.create_url_key(url)
found = current.find_by(url_key: key)
return found.page if found

with_pages = PageUrl.eager_load(:page).order(to_time: :desc)
found = with_pages.find_by(url: url) ||
with_pages.find_by(url_key: key)
return found.page if found

# TODO: remove this fallback when data is migrated over to Page.urls.
Page.find_by(url: url) || Page.find_by(url_key: key)
end

def self.normalize_url(url)
Expand All @@ -76,10 +105,6 @@ def self.normalize_url(url)
end
end

def self.create_url_key(url)
Surt.surt(url)
end

def add_maintainer(maintainer)
unless maintainer.is_a?(Maintainer)
maintainer = Maintainer.find_or_create_by(name: maintainer)
Expand Down Expand Up @@ -132,7 +157,7 @@ def as_json(options = {})
end

def update_url_key
update(url_key: Page.create_url_key(url))
update(url_key: PageUrl.create_url_key(url))
end

def ensure_domain_and_news_tags
Expand All @@ -141,20 +166,89 @@ def ensure_domain_and_news_tags
self.add_tag('news') if news?
end

# Keep page creation relatively simple by automatically creating a PageUrl
# for the page's current URL when creating a page. (Page#url is the current
# canonical Url of the page, the true list of URLs associated with the page
# should always be the list of PageUrls in Page#urls).
def ensure_page_urls
urls.find_or_create_by!(url: url) if saved_change_to_attribute?('url')
end

def update_status
new_status = calculate_status
self.update(status: new_status) unless new_status.zero?
self.status
end

def merge(*other_pages)
Page.transaction do
first_version_time = nil
other_pages.each do |other|
audit_data = other.create_audit_record

# Move versions from other page.
other.versions.to_a.each do |version|
self.versions << version
if first_version_time.nil? || first_version_time > version.capture_time
first_version_time = version.capture_time
end
end
# The above doesn't update the source page's `versions`, so reload.
other.versions.reload

# Copy other attributes from other page.
other.tags.each {|tag| add_tag(tag)}
other.maintainers.each {|maintainer| add_maintainer(maintainer)}
other.urls.each do |page_url|
# TODO: it would be slightly nicer to collapse/merge PageUrls with
# overlapping or intersecting time ranges here.
# NOTE: we have to be careful not to trip the DB's uniqueness
# constraints here or we hose the transaction.
just_created = false
merged_url = urls.find_or_create_by(
url: page_url.url,
from_time: page_url.from_time,
to_time: page_url.to_time
) do |new_url|
new_url.notes = page_url.notes
just_created = true
end

unless just_created
new_notes = [merged_url.notes, page_url.notes].compact.join(' ')
merged_url.update(notes: new_notes)
end
end

# Keep a record so we can redirect requests for the merged page.
# Delete the actual page record rather than keep it around so we don't
# have to worry about messy partial indexes and querying around URLs.
MergedPage.create!(uuid: other.uuid, target: self, audit_data: audit_data)
# If the page we're removing was previously a merge target, update
# its references.
MergedPage.where(target_uuid: other.uuid).update_all(target_uuid: self.uuid)
# And finally drop the merged page.
other.destroy!

audit_json = Oj.dump(audit_data, mode: :rails)
Rails.logger.info("Merged page #{other.uuid} into #{uuid}. Old data: #{audit_json}")
end

# Recalculate denormalized attributes
update_page_title(first_version_time)
update_versions_different(first_version_time)
# TODO: it might be neat to clean up overlapping URL timeframes
end
end

protected

def news?
url.include?('/news') || url.include?('/blog') || url.include?('/press')
end

def ensure_url_key
self.url_key ||= Page.create_url_key(url)
self.url_key ||= PageUrl.create_url_key(url)
end

def normalize_url
Expand Down Expand Up @@ -211,4 +305,47 @@ def calculate_status(relative_to: nil)
success_rate = 1 - (error_time.to_f / total_time)
success_rate < STATUS_SUCCESS_THRESHOLD ? latest_error : 200
end

def update_page_title(from_time = nil)
candidates = versions.reorder(capture_time: :desc)
candidates = candidates.where('capture_time >= ?', from_time) if from_time
candidates.each do |version|
break if version.sync_page_title
end
end

# TODO: figure out whether there's a reasonable way to merge this logic with
# `Version#update_different_attribute`.
def update_versions_different(from_time)
previous_hash = nil
candidates = versions
.where('capture_time >= ?', from_time)
.reorder(capture_time: :asc)

candidates.each do |version|
if previous_hash.nil?
previous_hash = version.previous(different: false).try(:version_hash)
end

version.update(different: version.version_hash != previous_hash)
previous_hash = version.version_hash
end
end

# Creates a hash representing the current state of a page. Used for logging
# and other audit related purposes when deleting/merging pages.
def create_audit_record
# URLs are entirely unique to the page, so have to be recorded
# completely rather than referenced by ID.
urls_data = urls.collect do |page_url|
page_url.attributes.slice('url', 'from_time', 'to_time', 'notes')
end

attributes.merge({
'tags' => tags.collect(&:name),
'maintainers' => maintainers.collect(&:name),
'versions' => versions.collect(&:uuid),
'urls' => urls_data
})
end
end
Loading