Skip to content

Conversation

@JivusAyrus
Copy link
Member

@JivusAyrus JivusAyrus commented Dec 23, 2025

Summary by CodeRabbit

  • Chores
    • Updated GraphQL metrics database schema with an improved table structure for better performance and data organization.
    • Completed a phased migration of historical metrics data to the updated schema.
    • Restructured metrics storage with optimized indexing and partitioning for enhanced query efficiency.

✏️ Tip: You can customize this high-level summary in your review settings.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
  • I have read the Contributors Guide.

@coderabbitai
Copy link

coderabbitai bot commented Dec 23, 2025

Walkthrough

A comprehensive database migration for the GraphQL metrics schema, introducing a new optimized table (gql_metrics_schema_usage_5m_90d_v2) with enhanced indices, migrating historical data across temporal windows, replacing the original table, and renaming the new version to the original name for seamless rollout.

Changes

Cohort / File(s) Summary
Schema Creation
graphqlmetrics/migrations/20251223110743_new_graphql_schema_usage_table_5m_90d.sql
Defines new gql_metrics_schema_usage_5m_90d_v2 table with 20 columns (DateTime, LowCardinality strings, Arrays, Bool, UInt64), bloom_filter and minmax indices on Path/SubgraphIDs/TypeNames/usage metrics, SummingMergeTree engine, date partitioning, composite ORDER BY key, and 90-day TTL.
Data Migration in Temporal Batches
graphqlmetrics/migrations/202512231316{20..24}_migrate_data_*.sql, graphqlmetrics/migrations/20251223131629_migrate_data_last_1_day.sql
Five sequential migrations copying data from old table to v2 in time windows: 90–68 days ago, 68–46 days ago, 46–24 days ago, 24–1 day ago, and last 1 day. Each uses INSERT...SELECT with WHERE clause filtering by Timestamp range.
Old Table Removal & Rename
graphqlmetrics/migrations/20251223131629_drop_old_gql_metrics_schema_usage_5m_90d.sql, graphqlmetrics/migrations/20251223131720_rename_new_gql_metrics_schema_usage_5m_90d.sql
First migration drops original gql_metrics_schema_usage_5m_90d table on up (with full schema restore on down); second migration renames gql_metrics_schema_usage_5m_90d_v2 back to gql_metrics_schema_usage_5m_90d to finalize the transition.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'chore: improve gql_schema_usage_5m_90d table' accurately describes the main change—creating a new version of the schema usage table, migrating data, and optimizing the table structure.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
graphqlmetrics/migrations/20251223110743_new_graphql_schema_usage_table_5m_90d.sql (1)

25-30: Index idx_operation_hash removed.

The old table included a bloom filter index on OperationHash, which is not present in the new schema. Since OperationHash is part of the ORDER BY clause, ClickHouse can still efficiently filter on it using primary key indexing. This is likely acceptable, but verify that query performance on OperationHash filters remains satisfactory after migration.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8237c8d and f96cebe.

📒 Files selected for processing (8)
  • graphqlmetrics/migrations/20251223110743_new_graphql_schema_usage_table_5m_90d.sql
  • graphqlmetrics/migrations/20251223131620_migrate_data_90_to_68_days.sql
  • graphqlmetrics/migrations/20251223131621_migrate_data_68_to_46_days.sql
  • graphqlmetrics/migrations/20251223131622_migrate_data_46_to_24_days.sql
  • graphqlmetrics/migrations/20251223131623_migrate_data_24_to_1_day.sql
  • graphqlmetrics/migrations/20251223131624_migrate_data_last_1_day.sql
  • graphqlmetrics/migrations/20251223131629_drop_old_gql_metrics_schema_usage_5m_90d.sql
  • graphqlmetrics/migrations/20251223131720_rename_new_gql_metrics_schema_usage_5m_90d.sql
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

  • graphqlmetrics/migrations/20251223131629_drop_old_gql_metrics_schema_usage_5m_90d.sql
  • graphqlmetrics/migrations/20251223110743_new_graphql_schema_usage_table_5m_90d.sql
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (7)
graphqlmetrics/migrations/20251223110743_new_graphql_schema_usage_table_5m_90d.sql (1)

32-36: Verify ENGINE type change is intentional.

The new table uses SummingMergeTree while the down migration in 20251223131629_drop_old_gql_metrics_schema_usage_5m_90d.sql recreates the old table with SharedSummingMergeTree. This is a significant change:

  • SharedSummingMergeTree is for ClickHouse Cloud/replicated setups
  • SummingMergeTree is for single-node deployments

Please confirm this ENGINE change aligns with your deployment topology.

graphqlmetrics/migrations/20251223131620_migrate_data_90_to_68_days.sql (1)

1-10: Data migration logic is sound.

The time-windowed approach for migrating data in chunks (90-68 days) is appropriate for large tables to avoid long-running transactions. The empty down migration is acceptable since the original table retains the data until the drop migration runs.

graphqlmetrics/migrations/20251223131621_migrate_data_68_to_46_days.sql (1)

1-10: LGTM!

Time window correctly specified with no overlap with adjacent migrations.

graphqlmetrics/migrations/20251223131622_migrate_data_46_to_24_days.sql (1)

1-10: LGTM!

Consistent with the migration sequence pattern.

graphqlmetrics/migrations/20251223131623_migrate_data_24_to_1_day.sql (1)

1-10: LGTM!

Time window boundaries are correct.

graphqlmetrics/migrations/20251223131629_drop_old_gql_metrics_schema_usage_5m_90d.sql (1)

1-42: Down migration recreates schema but not data.

If a rollback is needed, the down migration will recreate an empty gql_metrics_schema_usage_5m_90d table. Combined with the empty down migrations in the data migration files and the rename down, a full rollback leaves you with:

  1. gql_metrics_schema_usage_5m_90d_v2 containing all data (renamed back from gql_metrics_schema_usage_5m_90d)
  2. A recreated but empty gql_metrics_schema_usage_5m_90d

This may be acceptable if the rollback strategy is to fix forward, but document this limitation for operational awareness.

graphqlmetrics/migrations/20251223131720_rename_new_gql_metrics_schema_usage_5m_90d.sql (1)

1-7: LGTM!

The rename operation correctly follows the drop migration, avoiding name conflicts. The down migration properly reverses the rename.

)
ENGINE = SummingMergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (FederatedGraphID, Timestamp, OrganizationID, FieldName, NamedType, OperationHash, ClientName, ClientVersion)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

ORDER BY change may alter aggregation semantics.

The new ORDER BY excludes columns present in the old table's ordering:

  • Path, TypeNames, SubgraphIDs (Array columns)
  • RouterConfigVersion, IsArgument, IsInput

With SummingMergeTree, rows matching on all ORDER BY columns are merged, summing TotalUsages, TotalErrors, TotalClientErrors. Rows that previously remained distinct (differing by Path, IsArgument, etc.) will now be merged together, potentially losing granularity.

Please confirm this aggregation behavior change is intentional and won't impact downstream queries that rely on per-path or per-argument-type breakdowns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants