-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[ENH]: Optimize GetCollections and remove usage of raw gorm #5274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH]: Optimize GetCollections and remove usage of raw gorm #5274
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Optimize GetCollections: Remove Raw GORM, Use Subqueries for Efficient Querying This PR refactors the GetCollections logic within the collection DAO to optimize query performance and improve maintainability. The main enhancement involves replacing the previous Common Table Expression (CTE) and raw GORM SQL usage-which were reverted due to binding issues-with a subquery-based approach that stays within GORM's standard query-building capabilities. This optimizes for the case where the database's primary key is fully specified (i.e., database name and tenant ID provided), leveraging a predicate subquery to fetch the correct database_id, which integrates more cleanly with GORM and avoids the pitfalls of manual SQL string handling. Additional improvements are made to the organization and clarity of the query construction code, and a session-level Postgres Key Changes• Replaced raw Affected Areas• go/pkg/sysdb/metastore/db/dao/collection.go ( This summary was automatically generated by @propel-code-bot |
if isQueryOptimized { | ||
// Setting random_page_cost to 1.1 because that's usually the recommended value | ||
// for SSD based databases. This encourages index usage. The default used | ||
// to be 4.0 which was more for HDD based databases where random seeking | ||
// was way more expensive than sequential access. | ||
var dummy []Result | ||
stmt := query.Session(&gorm.Session{DryRun: true}).Find(&dummy).Statement | ||
sqlString := stmt.SQL.String() | ||
|
||
// Use a transaction to execute both commands in a single round trip | ||
err := s.db.Transaction(func(tx *gorm.DB) error { | ||
if err := tx.Exec("SET LOCAL random_page_cost = 1.1").Error; err != nil { | ||
return err | ||
} | ||
return tx.Raw(sqlString, stmt.Vars...).Scan(&results).Error | ||
}) | ||
if err != nil { | ||
return nil, err | ||
} | ||
} else { | ||
if err := query.Scan(&results).Error; err != nil { | ||
return nil, err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The error handling logic is duplicated within the if isQueryOptimized
and else
branches. To improve maintainability and reduce code duplication, you could declare an err
variable before the conditional block, assign the error from each branch to it, and then perform a single error check after the block.
For example:
var err error
if isQueryOptimized {
// ...
err = s.db.Transaction(...)
} else {
err = query.Scan(&results).Error
}
if err != nil {
return nil, err
}
fde0390
to
df94728
Compare
Select("collections.id as collection_id, collections.name as collection_name, collections.configuration_json_str, collections.dimension, collections.database_id AS database_id, collections.ts as collection_ts, collections.is_deleted, collections.created_at as collection_created_at, collections.updated_at as collection_updated_at, collections.log_position, collections.version, collections.version_file_name, collections.root_collection_id, NULLIF(collections.lineage_file_name, '') AS lineage_file_name, collections.total_records_post_compaction, collections.size_bytes_post_compaction, collections.last_compaction_time_secs, databases.name as database_name, databases.tenant_id as db_tenant_id, collections.tenant as tenant"). | ||
Joins("INNER JOIN databases ON collections.database_id = databases.id"). | ||
Order("collections.created_at ASC") | ||
isQueryOptimized := true && databaseName != "" && tenantID != "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The isQueryOptimized
condition hardcodes true
which means the optimization will always be attempted when both databaseName
and tenantID
are non-empty. Consider making this configurable or adding validation to ensure this optimization is beneficial in all cases.
isQueryOptimized := true && databaseName != "" && tenantID != "" | |
isQueryOptimized := databaseName != "" && tenantID != "" |
This removes the hardcoded true
and makes the condition more explicit about when optimization should be applied.
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
} else { | ||
query = query.Select(collection_targets + db_targets + collection_tenant). | ||
Joins("INNER JOIN databases ON collections.database_id = databases.id") | ||
} | ||
|
||
if databaseName != "" { | ||
query = query.Order("collections.created_at ASC") | ||
|
||
if databaseName != "" && !isQueryOptimized { | ||
query = query.Where("databases.name = ?", databaseName) | ||
} | ||
if tenantID != "" { | ||
if tenantID != "" && !isQueryOptimized { | ||
query = query.Where("databases.tenant_id = ?", tenantID) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
To improve clarity and group related logic, consider moving the Where
clauses for databaseName
and tenantID
into the else
block. This makes it clearer that these filters only apply to the non-optimized query path and removes the need to check !isQueryOptimized
twice.
df94728
to
2d0b197
Compare
err = s.db.Transaction(func(tx *gorm.DB) error { | ||
if err := tx.Exec("SET LOCAL random_page_cost = 1.1").Error; err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The parameter random_page_cost
is being set as a session-level configuration change inside a transaction. However, using SET LOCAL
within a transaction means this setting will be reset after the transaction completes, which could lead to inconsistent query performance if the same query is executed outside this transaction context.
Consider either:
- Using
SET SESSION random_page_cost = 1.1
if you want this setting to persist for the entire database session - Or documenting that this optimization only applies to queries within this specific transaction
The current approach may cause performance inconsistencies between optimized and non-optimized query paths.
Co-authored-by: propel-code-bot[bot] <203372662+propel-code-bot[bot]@users.noreply.github.com>
Select("id"). | ||
Where("tenant_id = ?", tenantID). | ||
Where("name = ?", databaseName). | ||
Limit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Documentation]
The subquery approach is well-designed and should be more reliable than the previous CTE approach. However, consider adding a comment explaining why the Limit(1)
is important here - it ensures PostgreSQL treats this as an efficient lookup rather than a full scan, which was mentioned in the PR description as a key optimization technique.
Description of changes
An earlier change attempted to use a CTE to optimize the GetCollections query in the event where the databases pkey is fully specified. This led to using raw gorm which led to a few unexpected binding issues when converting from gorm sql to raw sql, leading us to revert that change here. In this change, we achieve the same goal of optimizing GetCollections but avoid the use of CTE's entirely. We get the required database_id from databases within a predicate subquery. If the predicate subquery has a LIMIT 1 in it, Postgres is known to treat this as a CTE. Since gorm plays much more nicely with subqueries than CTE's this is a much better fix.
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?