Skip to content

Commit ac2f8d1

Browse files
committed
Add new page URLs during import
This is pretty conservative -- we don't touch redirect URLs since it's not uncommon for a page to redirect somewhere else (usually a parent path) instead of sending a 403/404 when it gets removed. I left a TODO about the situation to prompt further ideation on strategies to do better.
1 parent 0fa85ed commit ac2f8d1

File tree

2 files changed

+37
-0
lines changed

2 files changed

+37
-0
lines changed

app/jobs/import_versions_job.rb

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,20 @@ def page_for_record(record, create: true, row:)
191191
(record['page_maintainers'] || []).each {|name| page.add_maintainer(name)}
192192
(record['page_tags'] || []).each {|name| page.add_tag(name)}
193193

194+
# If the page was not an *exact* URL match, add the URL to the page.
195+
# (`page.find_by_url` used above will match by `url_key`, too.)
196+
page.urls.find_or_create_by(url: url)
197+
# TODO: Add URLs from redirects automatically. The main blocker for
198+
# this at the moment is the following situation:
199+
#
200+
# Two pages, A=https://example.com/about
201+
# B=https://example.com/about/locations
202+
# Page B is removed, but instead of returning a 404 or 403 status
203+
# code, it starts redirecting to Page A.
204+
#
205+
# This is unfortunately not uncommon, so we need some heuristics to
206+
# account for it, e.g. URL does not already belong to another page.
207+
194208
page
195209
end
196210

test/jobs/import_versions_job_test.rb

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -318,4 +318,27 @@ def teardown
318318
version = pages(:home_page).latest
319319
assert_equal('text/html', version.media_type)
320320
end
321+
322+
test 'adds URL to an existing page if the version was matched to a page with a different URL' do
323+
page = Page.create(url: 'https://example.gov/office')
324+
import = Import.create_with_data(
325+
{ user: users(:alice) },
326+
[
327+
{
328+
page_url: 'http://example.gov/office/',
329+
capture_time: Time.now - 1.second,
330+
uri: 'https://test-bucket.s3.amazonaws.com/whatever',
331+
version_hash: 'abc',
332+
}
333+
].map(&:to_json).join("\n")
334+
)
335+
ImportVersionsJob.perform_now(import)
336+
337+
assert_equal(1, page.versions.count, 'Version was added to the right page')
338+
assert_equal(
339+
['https://example.gov/office', 'http://example.gov/office/'],
340+
page.urls.pluck(:url),
341+
'New URL was added to the page'
342+
)
343+
end
321344
end

0 commit comments

Comments
 (0)