Skip to content

[ENH]: Optimize GetCollections and remove usage of raw gorm #5274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

tanujnay112
Copy link
Contributor

@tanujnay112 tanujnay112 commented Aug 14, 2025

Description of changes

An earlier change attempted to use a CTE to optimize the GetCollections query in the event where the databases pkey is fully specified. This led to using raw gorm which led to a few unexpected binding issues when converting from gorm sql to raw sql, leading us to revert that change here. In this change, we achieve the same goal of optimizing GetCollections but avoid the use of CTE's entirely. We get the required database_id from databases within a predicate subquery. If the predicate subquery has a LIMIT 1 in it, Postgres is known to treat this as a CTE. Since gorm plays much more nicely with subqueries than CTE's this is a much better fix.

  • Improvements & Bug fixes
    • ...
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@tanujnay112 tanujnay112 marked this pull request as ready for review August 14, 2025 04:33
Copy link
Contributor

propel-code-bot bot commented Aug 14, 2025

Optimize GetCollections: Remove Raw GORM, Use Subqueries for Efficient Querying

This PR refactors the GetCollections logic within the collection DAO to optimize query performance and improve maintainability. The main enhancement involves replacing the previous Common Table Expression (CTE) and raw GORM SQL usage-which were reverted due to binding issues-with a subquery-based approach that stays within GORM's standard query-building capabilities. This optimizes for the case where the database's primary key is fully specified (i.e., database name and tenant ID provided), leveraging a predicate subquery to fetch the correct database_id, which integrates more cleanly with GORM and avoids the pitfalls of manual SQL string handling. Additional improvements are made to the organization and clarity of the query construction code, and a session-level Postgres random_page_cost parameter optimization is set during the optimized query path.

Key Changes

• Replaced raw GORM SQL/CTE approach with a predicate subquery for acquiring database_id when optimizing GetCollections queries.
• Refactored query selection logic to use the optimized subquery path only when both databaseName and tenantID are specified and the feature flag is set.
• Set random_page_cost to 1.1 at the transaction level for the optimized query path to improve performance on SSD-based Postgres installations.
• Normalized result mapping and post-processing code for collection and metadata rows.
• Removed prior raw SQL and CTE construction logic, fully reverting to GORM-native query generation.

Affected Areas

• go/pkg/sysdb/metastore/db/dao/collection.go (GetCollections and related methods)

This summary was automatically generated by @propel-code-bot

Comment on lines 243 to 266
if isQueryOptimized {
// Setting random_page_cost to 1.1 because that's usually the recommended value
// for SSD based databases. This encourages index usage. The default used
// to be 4.0 which was more for HDD based databases where random seeking
// was way more expensive than sequential access.
var dummy []Result
stmt := query.Session(&gorm.Session{DryRun: true}).Find(&dummy).Statement
sqlString := stmt.SQL.String()

// Use a transaction to execute both commands in a single round trip
err := s.db.Transaction(func(tx *gorm.DB) error {
if err := tx.Exec("SET LOCAL random_page_cost = 1.1").Error; err != nil {
return err
}
return tx.Raw(sqlString, stmt.Vars...).Scan(&results).Error
})
if err != nil {
return nil, err
}
} else {
if err := query.Scan(&results).Error; err != nil {
return nil, err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The error handling logic is duplicated within the if isQueryOptimized and else branches. To improve maintainability and reduce code duplication, you could declare an err variable before the conditional block, assign the error from each branch to it, and then perform a single error check after the block.

For example:

var err error
if isQueryOptimized {
    // ...
    err = s.db.Transaction(...)
} else {
    err = query.Scan(&results).Error
}

if err != nil {
    return nil, err
}

@tanujnay112 tanujnay112 force-pushed the 08-13-_enh_optimize_getcollections_and_remove_usage_of_raw_gorm branch from fde0390 to df94728 Compare August 14, 2025 15:58
Select("collections.id as collection_id, collections.name as collection_name, collections.configuration_json_str, collections.dimension, collections.database_id AS database_id, collections.ts as collection_ts, collections.is_deleted, collections.created_at as collection_created_at, collections.updated_at as collection_updated_at, collections.log_position, collections.version, collections.version_file_name, collections.root_collection_id, NULLIF(collections.lineage_file_name, '') AS lineage_file_name, collections.total_records_post_compaction, collections.size_bytes_post_compaction, collections.last_compaction_time_secs, databases.name as database_name, databases.tenant_id as db_tenant_id, collections.tenant as tenant").
Joins("INNER JOIN databases ON collections.database_id = databases.id").
Order("collections.created_at ASC")
isQueryOptimized := true && databaseName != "" && tenantID != ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The isQueryOptimized condition hardcodes true which means the optimization will always be attempted when both databaseName and tenantID are non-empty. Consider making this configurable or adding validation to ensure this optimization is beneficial in all cases.

Suggested change
isQueryOptimized := true && databaseName != "" && tenantID != ""
isQueryOptimized := databaseName != "" && tenantID != ""

This removes the hardcoded true and makes the condition more explicit about when optimization should be applied.

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Comment on lines +198 to 211
} else {
query = query.Select(collection_targets + db_targets + collection_tenant).
Joins("INNER JOIN databases ON collections.database_id = databases.id")
}

if databaseName != "" {
query = query.Order("collections.created_at ASC")

if databaseName != "" && !isQueryOptimized {
query = query.Where("databases.name = ?", databaseName)
}
if tenantID != "" {
if tenantID != "" && !isQueryOptimized {
query = query.Where("databases.tenant_id = ?", tenantID)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

To improve clarity and group related logic, consider moving the Where clauses for databaseName and tenantID into the else block. This makes it clearer that these filters only apply to the non-optimized query path and removes the need to check !isQueryOptimized twice.

@tanujnay112 tanujnay112 force-pushed the 08-13-_enh_optimize_getcollections_and_remove_usage_of_raw_gorm branch from df94728 to 2d0b197 Compare August 14, 2025 16:46
Comment on lines +254 to +255
err = s.db.Transaction(func(tx *gorm.DB) error {
if err := tx.Exec("SET LOCAL random_page_cost = 1.1").Error; err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The parameter random_page_cost is being set as a session-level configuration change inside a transaction. However, using SET LOCAL within a transaction means this setting will be reset after the transaction completes, which could lead to inconsistent query performance if the same query is executed outside this transaction context.

Consider either:

  1. Using SET SESSION random_page_cost = 1.1 if you want this setting to persist for the entire database session
  2. Or documenting that this optimization only applies to queries within this specific transaction

The current approach may cause performance inconsistencies between optimized and non-optimized query paths.

Co-authored-by: propel-code-bot[bot] <203372662+propel-code-bot[bot]@users.noreply.github.com>
Select("id").
Where("tenant_id = ?", tenantID).
Where("name = ?", databaseName).
Limit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Documentation]

The subquery approach is well-designed and should be more reliable than the previous CTE approach. However, consider adding a comment explaining why the Limit(1) is important here - it ensures PostgreSQL treats this as an efficient lookup rather than a full scan, which was mentioned in the PR description as a key optimization technique.

@tanujnay112 tanujnay112 merged commit 982b1f1 into main Aug 14, 2025
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant