Skip to content

Add PageUrl model to describe pages with multiple URLs #793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Dec 13, 2020

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Dec 5, 2020

This adds support for multiple URLs per page. The url field directly on a Page should now be treated as the current canonical URL, but pages may also have any number of additional URLs associated with them. Page + URL combinations may have a timeframe (>= PageUrl.start_time and < PageUrl.end_time) describing when they are valid and a notes field that are meant to mostly carry anecdotal information. The timeframe does have one technical purpose: if two pages occupied the same URL, we prefer the most current when adding new versions.

Things that still need to happen here:

  • Basic Migration
  • New class and associations
  • Tests for PageUrl
  • Data migration to populate new PageUrl records for existing pages
  • Tooling to merge pages
  • Update import logic to account for this (I think the only change we need is to add a PageUrl record if we found a non-exact page match [i.e. we matched against url_key and not url])
  • Add to the API (We may need to update the ?url= query param on PagesController; not sure if we should return the URLs with page records or add a new controller for /api/v0/pages/{id}/urls).
  • Redirect merged pages to page they got merged into

Fixes #492.

This is required for date ranges involving +/-infinity, which we are about to add for tracking multiple URLs per page.
The tests for `Page` here won't pass, and there aren't any tests for `PageUrl`, but this is a start.
I forgot we still need to check the old fields until we've migrated all our data!
This was causing tests to fail by modifying in-memory record objects and incorrectly setting their values when doing operations that otherwise seem innocuous (e.g. `collection.to_a`). This is really just a flavor of rails/rails#40595, and I should probably add some commentary there.

The bug can also be worked around by setting:

    config.active_record.time_zone_aware_attributes = false

But I'm a little leery of messing with timezone handling in the middle of everything else right now. Tests pass, but that doesn't give me a huge amount of confidence, TBH.
@Mr0grog Mr0grog force-pushed the 492-multi-url-pages branch from ea7b1e6 to 4e4276c Compare December 10, 2020 17:24
This is pretty conservative -- we don't touch redirect URLs since it's not uncommon for a page to redirect somewhere else (usually a parent path) instead of sending a 403/404 when it gets removed. I left a TODO about the situation to prompt further ideation on strategies to do better.
@Mr0grog Mr0grog force-pushed the 492-multi-url-pages branch from ac2f8d1 to 5df107a Compare December 10, 2020 21:19
Also be more explicit about what should happen to dependent models when pages are destroyed.
We won't ever create an orphaned version situation without manual intervention (for now, at least). The main goal is to support a scenarios where we had two Pages:

    Page A=https://example.gov/a
    Page B=https://example.gov/b

1. Page B didn't originally exist, but was created when Page A was moved to that URL.
2. For a while, Page A redirected to Page B, versions attached to each of them have the same content.
3. Then Page A became a 404 page, so versions attached to A have different content than those attached to B.

A and B are the same conceptual page here, and now that we can support multiple URLs, we'd like to merge them. However, putting all the versions together would work fine right up until the point in time where A started responding with a 404 status code. Versions after that time will show up in our new, unified page's timeline as interleaved 200 and 404 responses, which is very confusing. To handle this, it should be possible to remove those 404 versions from the page. However, they have nowhere to go at this point, and no conceptual "page" that they really belong to.

(For now, at least, we won't be removing those versions from the page automatically. We might do so in the future.)
Instead of leaving merged pages in place, delete them entirely and create records in a new `merged_pages` table that lists the old ID and the ID of the page they were merged into. Just in case we need it, the table has a JSON copy of most of the data from the old page.

I'm ever so slightly worried about whether this is the right move instead of just leaving the pages in places with a new column for merge targets. It's complex, but leaves us with fewer places to make mistakes if our indexes and queries aren't correctly excluding merged pages.
@Mr0grog Mr0grog force-pushed the 492-multi-url-pages branch from 1d0f232 to 0b1bcf4 Compare December 11, 2020 09:36
@Mr0grog Mr0grog marked this pull request as ready for review December 11, 2020 09:53
@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 11, 2020

This should be set to go!

We were previously preferring URLs that are valid *now*, which is not so great if you are importing historical data.
@Mr0grog Mr0grog merged commit afa4988 into main Dec 13, 2020
@Mr0grog Mr0grog deleted the 492-multi-url-pages branch December 13, 2020 00:55
Mr0grog added a commit that referenced this pull request Dec 13, 2020
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Dec 13, 2020
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ui that referenced this pull request Dec 15, 2020
When loading page data on the page details view, we might get a redirect from the API because a page was merged. This reflects that redirect in the UI's URL, ensuring that old change links redirect to the same view for the new, merged page.

Merged pages are new in edgi-govdata-archiving/web-monitoring-db#793.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow pages to have multiple URLs
1 participant