-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add API route for sampled list of versions #946
Merged
Mr0grog
merged 4 commits into
main
from
161-sometimes-i-dont-actually-want-every-version
Jun 28, 2022
Merged
Add API route for sampled list of versions #946
Mr0grog
merged 4 commits into
main
from
161-sometimes-i-dont-actually-want-every-version
Jun 28, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This adds a new API route at `/api/v0/pages/:page_id/versions/sampled` that gets a "sampled" list of versions -- that is, one version per sampling period. Right now, the sampling period is hard-coded to one day, but we might change that in the future. The main idea here is to give people a list of versions to look at, but reduce the amount of data in a useful way for pages that we capture extremely often (e.g. `https://epa.gov` gets many captures every day). For most use cases, we don't want to look more granularly than each day. This is part of edgi-govdata-archiving/web-monitoring#161
Mr0grog
commented
Mar 14, 2022
Comment on lines
+25
to
+27
# FIXME: this should probably have special pagination by date, since we | ||
# don't want a sample group split across two responses. | ||
paging = pagination(query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May not want to wait for this, since it will be messy, and for the endpoint to be useful we also need to call it from the -ui project. Probably not worth blocking that work on this.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ops
that referenced
this pull request
Jun 28, 2022
Mr0grog
added a commit
that referenced
this pull request
Aug 26, 2022
In #946, I added a `/pages/:id/versions/sampled` endpoint that returns no more than one page per date. The goal was to reduce the number of queries and the amount of data transferred to load some pages that have a *lot* of snapshots. It turns out this didn't work well because it still paginated by the number of versions it was querying, which means it still required the exact same number of requests (d'oh!). This changes the approach to paginate by date range instead, which should make a big difference. The code is pretty rough, but as long as it works, this needs to deploy so we can try it out with the bigger production data set.
Mr0grog
added a commit
that referenced
this pull request
Aug 26, 2022
In #946, I added a /pages/:id/versions/sampled endpoint that returns no more than one page per date. The goal was to reduce the number of queries and the amount of data transferred to load some pages that have a lot of snapshots. It turns out this didn't work well because it still paginated by the number of versions it was querying, which means it still required the exact same number of requests (d'oh!). (It did reduce the amount of data sent, but not by much after gzipping responses. It was still many megabytes for some pages.) This changes the approach to paginate by date range instead, which should make a big difference. The code is pretty rough, but as long as it works, I plan to deploy so we can try it out with the bigger production data set.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ui
that referenced
this pull request
Aug 26, 2022
In edgi-govdata-archiving/web-monitoring-db#946, we added a new API endpoint at `/api/v0/pages/:page_id/versions/sampled` that gets a "sampled" list of versions — that is, one version per sampling period (for now, the sampling period is always 1 day, so this is one version per day). This makes use of it. The main idea here is to load data reasonably quickly — some pages (e.g. `https://epa.gov/`) have a *lot* of versions, and require hundreds of requests and many megabytes of data in order to load the details page. Loading a one-per-day sample makes pages much quicker to load, with fewer, smaller requests, but still giving people a complete-enough list of versions to select from.
Mr0grog
added a commit
that referenced
this pull request
Feb 7, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds a new API route at
/api/v0/pages/:page_id/versions/sampled
that gets a "sampled" list of versions — that is, one version per sampling period. Right now, the sampling period is hard-coded to one day, but we might change that in the future. The main idea here is to give people a list of versions to look at, but reduce the amount of data in a useful way for pages that we capture extremely often (e.g.https://epa.gov
/ gets many captures every day). For most use cases, we don't want to look more granularly than each day.Basically, instead of getting a list of versions, this gets a list of objects like:
The order of sample periods is always latest first.
This is part of edgi-govdata-archiving/web-monitoring#161