Add `PageUrl` model to describe pages with multiple URLs #793

Mr0grog · 2020-12-05T07:33:31Z

This adds support for multiple URLs per page. The url field directly on a Page should now be treated as the current canonical URL, but pages may also have any number of additional URLs associated with them. Page + URL combinations may have a timeframe (>= PageUrl.start_time and < PageUrl.end_time) describing when they are valid and a notes field that are meant to mostly carry anecdotal information. The timeframe does have one technical purpose: if two pages occupied the same URL, we prefer the most current when adding new versions.

Things that still need to happen here:

Basic Migration
New class and associations
Tests for PageUrl
Data migration to populate new PageUrl records for existing pages
Tooling to merge pages
Update import logic to account for this (I think the only change we need is to add a PageUrl record if we found a non-exact page match [i.e. we matched against url_key and not url])
Add to the API (We may need to update the ?url= query param on PagesController; not sure if we should return the URLs with page records or add a new controller for /api/v0/pages/{id}/urls).
Redirect merged pages to page they got merged into

Fixes #492.

This is required for date ranges involving +/-infinity, which we are about to add for tracking multiple URLs per page.

The tests for `Page` here won't pass, and there aren't any tests for `PageUrl`, but this is a start.

I forgot we still need to check the old fields until we've migrated all our data!

This was causing tests to fail by modifying in-memory record objects and incorrectly setting their values when doing operations that otherwise seem innocuous (e.g. `collection.to_a`). This is really just a flavor of rails/rails#40595, and I should probably add some commentary there. The bug can also be worked around by setting: config.active_record.time_zone_aware_attributes = false But I'm a little leery of messing with timezone handling in the middle of everything else right now. Tests pass, but that doesn't give me a huge amount of confidence, TBH.

This is pretty conservative -- we don't touch redirect URLs since it's not uncommon for a page to redirect somewhere else (usually a parent path) instead of sending a 403/404 when it gets removed. I left a TODO about the situation to prompt further ideation on strategies to do better.

Also be more explicit about what should happen to dependent models when pages are destroyed.

We won't ever create an orphaned version situation without manual intervention (for now, at least). The main goal is to support a scenarios where we had two Pages: Page A=https://example.gov/a Page B=https://example.gov/b 1. Page B didn't originally exist, but was created when Page A was moved to that URL. 2. For a while, Page A redirected to Page B, versions attached to each of them have the same content. 3. Then Page A became a 404 page, so versions attached to A have different content than those attached to B. A and B are the same conceptual page here, and now that we can support multiple URLs, we'd like to merge them. However, putting all the versions together would work fine right up until the point in time where A started responding with a 404 status code. Versions after that time will show up in our new, unified page's timeline as interleaved 200 and 404 responses, which is very confusing. To handle this, it should be possible to remove those 404 versions from the page. However, they have nowhere to go at this point, and no conceptual "page" that they really belong to. (For now, at least, we won't be removing those versions from the page automatically. We might do so in the future.)

Instead of leaving merged pages in place, delete them entirely and create records in a new `merged_pages` table that lists the old ID and the ID of the page they were merged into. Just in case we need it, the table has a JSON copy of most of the data from the old page. I'm ever so slightly worried about whether this is the right move instead of just leaving the pages in places with a new column for merge targets. It's complex, but leaves us with fewer places to make mistakes if our indexes and queries aren't correctly excluding merged pages.

Mr0grog · 2020-12-11T09:53:22Z

This should be set to go!

We were previously preferring URLs that are valid *now*, which is not so great if you are importing historical data.

When loading page data on the page details view, we might get a redirect from the API because a page was merged. This reflects that redirect in the UI's URL, ensuring that old change links redirect to the same view for the new, merged page. Merged pages are new in edgi-govdata-archiving/web-monitoring-db#793.

Mr0grog added 3 commits December 4, 2020 21:21

Add hack to fix ActiveRecord schema.rb bug

f455f5b

This is required for date ranges involving +/-infinity, which we are about to add for tracking multiple URLs per page.

Add migration and new PageUrl model

a7697fa

The tests for `Page` here won't pass, and there aren't any tests for `PageUrl`, but this is a start.

Add tests for/clean up logic in Page.find_by_url

baeb01c

Mr0grog added the enhancement label Dec 5, 2020

Mr0grog added 11 commits December 7, 2020 12:09

Update Page.urls *any* time Page.url changes

4201ba6

Fall back to old behavior for Page.find_by_url

6e4d4f8

I forgot we still need to check the old fields until we've migrated all our data!

Add data migration for populating page.urls

90d63a3

Add PageUrl tests, make PageUrl#url immutable

1284a39

OBEY RUBOCOP

a54f5ba

Add page merging

8712e8f

OBEY RUBOCOP

210f249

Fail harder when ensuring we have PageUrl records

710bd0b

Fix tests on CI

da25c6d

Try fixing CI tests again

4e4276c

Mr0grog force-pushed the 492-multi-url-pages branch from ea7b1e6 to 4e4276c Compare December 10, 2020 17:24

Mr0grog added 2 commits December 10, 2020 11:51

Add rake task to merge pages

0fa85ed

Mr0grog force-pushed the 492-multi-url-pages branch from ac2f8d1 to 5df107a Compare December 10, 2020 21:19

Mr0grog added 6 commits December 10, 2020 18:54

Wrap Page#merge in transaction

7f23115

Also be more explicit about what should happen to dependent models when pages are destroyed.

Redirect API requests for merged page to new page

248649d

OBEY RUBOCOP

9dd8f43

Expose PageUrls in API at /api/v0/pages/:id/urls

0b1bcf4

Mr0grog force-pushed the 492-multi-url-pages branch from 1d0f232 to 0b1bcf4 Compare December 11, 2020 09:36

Query URLs in API from page_urls instead of page

d1582cd

Mr0grog marked this pull request as ready for review December 11, 2020 09:53

When importing prefer URLs valid at *capture* time

60a23f5

We were previously preferring URLs that are valid *now*, which is not so great if you are importing historical data.

Fix foreign key bug when merging, log data as JSON

b98c19a

Mr0grog merged commit afa4988 into main Dec 13, 2020

Mr0grog deleted the 492-multi-url-pages branch December 13, 2020 00:55

Mr0grog added a commit that referenced this pull request Dec 13, 2020

Release #793

73dfdc5

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Dec 13, 2020

Deploy edgi-govdata-archiving/web-monitoring-db#793

57b80b7

Mr0grog mentioned this pull request Dec 15, 2020

Redirect to the correct URL for merged pages edgi-govdata-archiving/web-monitoring-ui#660

Merged

Mr0grog mentioned this pull request Jun 29, 2021

Evaluate new URLs for pages that redirect edgi-govdata-archiving/web-monitoring#163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add `PageUrl` model to describe pages with multiple URLs #793

Add `PageUrl` model to describe pages with multiple URLs #793

Uh oh!

Mr0grog commented Dec 5, 2020 •

edited

Loading

Uh oh!

Mr0grog commented Dec 11, 2020

Uh oh!

Uh oh!

Uh oh!

Add PageUrl model to describe pages with multiple URLs #793

Add PageUrl model to describe pages with multiple URLs #793

Uh oh!

Conversation

Mr0grog commented Dec 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mr0grog commented Dec 11, 2020

Uh oh!

Uh oh!

Add `PageUrl` model to describe pages with multiple URLs #793

Add `PageUrl` model to describe pages with multiple URLs #793

Mr0grog commented Dec 5, 2020 •

edited

Loading