Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API route for sampled list of versions #946

Merged
merged 4 commits into from
Jun 28, 2022

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Mar 14, 2022

This adds a new API route at /api/v0/pages/:page_id/versions/sampled that gets a "sampled" list of versions — that is, one version per sampling period. Right now, the sampling period is hard-coded to one day, but we might change that in the future. The main idea here is to give people a list of versions to look at, but reduce the amount of data in a useful way for pages that we capture extremely often (e.g. https://epa.gov/ gets many captures every day). For most use cases, we don't want to look more granularly than each day.

Basically, instead of getting a list of versions, this gets a list of objects like:

{
  time: "2022-01-01",
  version_count: 3,
  // A version object like we would get from /versions.
  // This is the latest `different` version in the sample period
  version: { ... }
}

The order of sample periods is always latest first.

This is part of edgi-govdata-archiving/web-monitoring#161

This adds a new API route at `/api/v0/pages/:page_id/versions/sampled` that gets a "sampled" list of versions -- that is, one version per sampling period. Right now, the sampling period is hard-coded to one day, but we might change that in the future. The main idea here is to give people a list of versions to look at, but reduce the amount of data in a useful way for pages that we capture extremely often (e.g. `https://epa.gov` gets many captures every day). For most use cases, we don't want to look more granularly than each day.

This is part of edgi-govdata-archiving/web-monitoring#161
Comment on lines +25 to +27
# FIXME: this should probably have special pagination by date, since we
# don't want a sample group split across two responses.
paging = pagination(query)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not want to wait for this, since it will be messy, and for the endpoint to be useful we also need to call it from the -ui project. Probably not worth blocking that work on this.

@Mr0grog Mr0grog merged commit 711edc6 into main Jun 28, 2022
@Mr0grog Mr0grog deleted the 161-sometimes-i-dont-actually-want-every-version branch June 28, 2022 02:04
Mr0grog added a commit that referenced this pull request Jun 28, 2022
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jun 28, 2022
Mr0grog added a commit that referenced this pull request Aug 26, 2022
In #946, I added a `/pages/:id/versions/sampled` endpoint that returns no more than one page per date. The goal was to reduce the number of queries and the amount of data transferred to load some pages that have a *lot* of snapshots. It turns out this didn't work well because it still paginated by the number of versions it was querying, which means it still required the exact same number of requests (d'oh!). This changes the approach to paginate by date range instead, which should make a big difference. The code is pretty rough, but as long as it works, this needs to deploy so we can try it out with the bigger production data set.
Mr0grog added a commit that referenced this pull request Aug 26, 2022
In #946, I added a /pages/:id/versions/sampled endpoint that returns no more than one page per date. The goal was to reduce the number of queries and the amount of data transferred to load some pages that have a lot of snapshots. It turns out this didn't work well because it still paginated by the number of versions it was querying, which means it still required the exact same number of requests (d'oh!). (It did reduce the amount of data sent, but not by much after gzipping responses. It was still many megabytes for some pages.)

This changes the approach to paginate by date range instead, which should make a big difference. The code is pretty rough, but as long as it works, I plan to deploy so we can try it out with the bigger production data set.
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ui that referenced this pull request Aug 26, 2022
In edgi-govdata-archiving/web-monitoring-db#946, we added a new API endpoint at `/api/v0/pages/:page_id/versions/sampled` that gets a "sampled" list of versions — that is, one version per sampling period (for now, the sampling period is always 1 day, so this is one version per day). This makes use of it.

The main idea here is to load data reasonably quickly — some pages (e.g. `https://epa.gov/`) have a *lot* of versions, and require hundreds of requests and many megabytes of data in order to load the details page. Loading a one-per-day sample makes pages much quicker to load, with fewer, smaller requests, but still giving people a complete-enough list of versions to select from.
Mr0grog added a commit that referenced this pull request Feb 3, 2023
In #946 and #992, I added a route for sampling versions of a page (a better solution for most use cases than listing every version), but failed to include documentation. This adds a helpful description of the route.
Mr0grog added a commit that referenced this pull request Feb 7, 2023
In #946 and #992, I added a route for sampling versions of a page (a better solution for most use cases than listing every version), but failed to include documentation. This adds a helpful description of the route.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant