Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Statistics] Reduce DB size by removing unused data - Operation dimension #3625

Closed
xavierdecoster opened this issue Mar 4, 2017 · 3 comments

Comments

@xavierdecoster
Copy link
Member

Is this really still needed? The v3 client does not emit this data using the NuGet-Operation HTTP header. See NuGet/Home#3708

Check this example: https://www.nuget.org/stats/packages/Newtonsoft.Json?groupby=Operation

Out of 4.88M package downloads in the last 6 weeks, the operation is unknown for 4.53M of them.

We could reconsider storing this data when revisiting the design of the NuGet statistics feature, and if the v3 protocol somehow allows collecting this data.

Note

Even when we no longer import these datapoints into the stats database, we still have all the data in the w3c-logs we upload to blob storage.

Impact on database

A database size reduction could be achieved totalling 2,585.90 MB (or 2.53 GB)!

Removing these indices will also improve performance of insert/delete operations against the [dbo].[Fact_Download] and [dbo].[Fact_Dist_Download] tables.

In addition, we generate less churn on the data by avoiding clean-up operations.

The following database objects would become obsolete:

  • Table [dbo].[Dimension_Operation] (size = 0.14 MB)
  • Table column [dbo].[Fact_Download].[Dimension_Operation_Id] (size = 2,585.75 MB)
  • Stored procedure [dbo].[EnsureOperationDimensionsExist]
  • Non-clustered index [dbo].[Dimension_Operation].[Dimension_Operation_NCI_Operation] (size = 16 KB)

The following stored procedure would be modified:

  • [dbo].[DownloadReportRecentPopularityDetailByPackage] would no longer need to join with the [dbo].[Dimension_Operation] table.

The following database objects would decrease in size:

  • Non-clustered index [dbo].[Fact_Download].[Fact_Download_NCI_Package_Id] would no longer need to include the Dimension_Operation_Id column.

Impact on jobs

The Stats.CreateAzureCdnWarehouseReports job would be modified to no longer process the Operation dimension data, and likely gain faster throughput (less data, less table-joins, smaller blobs to upload to storage).

The Stats.ImportAzureCdnStatistics job would no longer need to store the Operation dimension data, which would improve import speed, throughput, and overall performance of the job.

We could:

  • avoid the need to check whether an Operation already exists, and if not, create it during import;
  • avoid the need to retain already imported Operation dimensions into in-memory cache;
  • reduce complexity of import by no more linking Operation dimension identifiers to the download facts.

Impact on gallery

  • Remove the Operation pivot from the UI
  • Stats-report blobs would no longer include operation dimension data (and thus be smaller + faster to consume)
@skofman1
Copy link
Contributor

We decided to go in a different direction to reduce DB size. Will reopen if needed.

@xavierdecoster
Copy link
Member Author

@skofman1 if we are to remove operation from UI, we should maybe reconsider this issue as well? #3921

@skofman1
Copy link
Contributor

@xavierdecoster , there are to issues with this:

  1. Removing operation data will completely break rankings: Search rankings file is calculated by partial information #3778
  2. There is no reason to do this work, since after we reduce the raw data in the DB to 43 days from 90 days, the DB will be small enough to get us to the stats redesign work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants