Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support T-1 roll-ups on package download statistics to reduce storage requirements #3423

Closed
dtivel opened this issue Dec 15, 2016 · 4 comments

Comments

@dtivel
Copy link
Contributor

dtivel commented Dec 15, 2016

Problem
Currently, we store 6 weeks' worth of raw data in the statistics database. The smallest unit of time that we report is a week. Every time we generate a report we process most (~98%) of the raw data again. Storing this much raw data is costly, and reprocessing the raw data repeatedly is unnecessary.

Solution
Roll up the previous days' worth of data with dimensions. Instead of having 42 days' worth of raw data, we should have 41 days of daily rollups and 1 day's worth of raw data. Capacity and DTU usage should both drop significantly.

Notes

  1. Consider perform this daily rollup on the top n most active packages, as there may be little or no benefit to less active packages.
  2. The Fact_Download_NCI_LogFileName index on the Fact_Download table can likely be dropped, thereby reducing storage requirements further.
  3. PowerBI dashboard queries may need updating.
@shishirx34
Copy link
Contributor

@xavierdecoster - This bug can be used to track your work.

@skofman1 skofman1 added this to the 2.13.2017 milestone Feb 9, 2017
@skofman1 skofman1 changed the title Statistics DB nearing capacity Support T-1 roll-ups on package download statistics to reduce storage requirements Feb 9, 2017
@skofman1 skofman1 removed this from the S114 - 2017.2.13 milestone Mar 4, 2017
@xavierdecoster
Copy link
Member

For this t1-rollup to be efficient and cost-effective, we need to reduce the number of sql indices on the facts tables, as they are a source of bad performance during roll-ups.

Will create separate issues to track suggestions, as we still retain lots of data we don't actually need in any functionality today.

@xavierdecoster
Copy link
Member

xavierdecoster commented Mar 4, 2017

Status update

Database size at start of investigation: 621.7 GB

Size reductions by:

  • (Measured) index rebuilds: 199.09 GB (down to 422.61 GB) - completed on 2017-03-06
  • (Estimated) performing all recommended improvements as listed below: 108.13 GB (down to 314.48 GB)

Total estimated database size after performing all recommendations = 314.48 GB (or a 25.59 % reduction vs. the re-indexed original database!)

Separate improvement suggestions:

@skofman1
Copy link
Contributor

We decided to try a different solution. Will reopen as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants