-
-
Notifications
You must be signed in to change notification settings - Fork 25
Add PageUrl
model to describe pages with multiple URLs
#793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is required for date ranges involving +/-infinity, which we are about to add for tracking multiple URLs per page.
The tests for `Page` here won't pass, and there aren't any tests for `PageUrl`, but this is a start.
I forgot we still need to check the old fields until we've migrated all our data!
This was causing tests to fail by modifying in-memory record objects and incorrectly setting their values when doing operations that otherwise seem innocuous (e.g. `collection.to_a`). This is really just a flavor of rails/rails#40595, and I should probably add some commentary there. The bug can also be worked around by setting: config.active_record.time_zone_aware_attributes = false But I'm a little leery of messing with timezone handling in the middle of everything else right now. Tests pass, but that doesn't give me a huge amount of confidence, TBH.
ea7b1e6
to
4e4276c
Compare
This is pretty conservative -- we don't touch redirect URLs since it's not uncommon for a page to redirect somewhere else (usually a parent path) instead of sending a 403/404 when it gets removed. I left a TODO about the situation to prompt further ideation on strategies to do better.
ac2f8d1
to
5df107a
Compare
Also be more explicit about what should happen to dependent models when pages are destroyed.
We won't ever create an orphaned version situation without manual intervention (for now, at least). The main goal is to support a scenarios where we had two Pages: Page A=https://example.gov/a Page B=https://example.gov/b 1. Page B didn't originally exist, but was created when Page A was moved to that URL. 2. For a while, Page A redirected to Page B, versions attached to each of them have the same content. 3. Then Page A became a 404 page, so versions attached to A have different content than those attached to B. A and B are the same conceptual page here, and now that we can support multiple URLs, we'd like to merge them. However, putting all the versions together would work fine right up until the point in time where A started responding with a 404 status code. Versions after that time will show up in our new, unified page's timeline as interleaved 200 and 404 responses, which is very confusing. To handle this, it should be possible to remove those 404 versions from the page. However, they have nowhere to go at this point, and no conceptual "page" that they really belong to. (For now, at least, we won't be removing those versions from the page automatically. We might do so in the future.)
Instead of leaving merged pages in place, delete them entirely and create records in a new `merged_pages` table that lists the old ID and the ID of the page they were merged into. Just in case we need it, the table has a JSON copy of most of the data from the old page. I'm ever so slightly worried about whether this is the right move instead of just leaving the pages in places with a new column for merge targets. It's complex, but leaves us with fewer places to make mistakes if our indexes and queries aren't correctly excluding merged pages.
1d0f232
to
0b1bcf4
Compare
This should be set to go! |
We were previously preferring URLs that are valid *now*, which is not so great if you are importing historical data.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ops
that referenced
this pull request
Dec 13, 2020
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ui
that referenced
this pull request
Dec 15, 2020
When loading page data on the page details view, we might get a redirect from the API because a page was merged. This reflects that redirect in the UI's URL, ensuring that old change links redirect to the same view for the new, merged page. Merged pages are new in edgi-govdata-archiving/web-monitoring-db#793.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds support for multiple URLs per page. The
url
field directly on aPage
should now be treated as the current canonical URL, but pages may also have any number of additional URLs associated with them. Page + URL combinations may have a timeframe (>= PageUrl.start_time
and< PageUrl.end_time
) describing when they are valid and a notes field that are meant to mostly carry anecdotal information. The timeframe does have one technical purpose: if two pages occupied the same URL, we prefer the most current when adding new versions.Things that still need to happen here:
PageUrl
PageUrl
records for existing pagesPageUrl
record if we found a non-exact page match [i.e. we matched againsturl_key
and noturl
])?url=
query param onPagesController
; not sure if we should return the URLs with page records or add a new controller for/api/v0/pages/{id}/urls
).Fixes #492.