-
Notifications
You must be signed in to change notification settings - Fork 1
[M1253] Using sql for update summary cache #554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
## Walkthrough
The pull request introduces systematic modifications across multiple SQL model files in the project. The primary changes involve replacing `NULL` with dynamically generated UUIDs for the `id` field in various SQL models, ensuring unique record identification. Additionally, some models have been updated with new fields or adjusted field configurations, such as increasing decimal precision for biomass measurements and adding JSON fields to capture more complex data structures. The changes aim to enhance data integrity and expand the models' capabilities for storing and representing scientific observation data.
## Changes
| File | Changes |
|------|---------|
| `src/api/models/sql_models/beltfish.py` | - Changed `id` generation from `NULL` to `uuid_generate_v4()` <br> - Increased `biomass_kgha` max digits from 11 to 13 |
| `src/api/models/sql_models/benthiclit.py` | - Changed `id` generation from `NULL` to `uuid_generate_v4()` |
| `src/api/models/sql_models/benthicpit.py` | - Changed `id` generation from `NULL` to `uuid_generate_v4()` <br> - Added `percent_cover_benthic_category` JSON field <br> - Added `percent_cover_life_histories` JSON field |
| `src/api/models/sql_models/benthicpqt.py` | - Changed `id` generation from `NULL` to `uuid_generate_v4()` <br> - Added `num_points_nonother` field |
| `src/api/models/sql_models/bleachingqc.py` | - Changed `id` generation from `NULL` to `uuid_generate_v4()` |
| `src/api/models/sql_models/habitatcomplexity.py` | - Changed `id` generation from `NULL` to `uuid_generate_v4()` <br> - Removed duplicate `sample_time` field <br> - Added `score_avg` and `score_sd` decimal fields |
| `src/api/models/summary_sample_events.py` | - Changed `id` generation from `NULL` to `uuid_generate_v4()` |
| `src/api/utils/summary_cache.py` | - Added multiple new SQL-related utility functions <br> - Replaced existing cache update functions with SQL-focused implementations |
## Possibly related PRs
- data-mermaid/mermaid-api#551: Modifies summary cache update logic by removing a transaction lock wrapper around an update call; related as both touch update mechanisms but differ in abstraction and scope.
- data-mermaid/mermaid-api#528: Also modifies the `biomass_kgha` field's `max_digits` attribute in `BeltFishSUSQLModel`, increasing precision, directly related to the same model and field changes.
- data-mermaid/mermaid-api#526: Introduces UUID generation for the `id` field in `BeltFishSUSQLModel`, similar to changes in this PR, indicating a strong connection in how unique identifiers are handled.
## Suggested reviewers
- gridcell✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
src/api/utils/summary_cache.py (8)
67-75: Consider adding a docstring to_convert_to_sql.
A short docstring can boost readability and clarify the function's purpose.
111-140: Refactor repeated pattern for improved maintainability.
This function uses a delete-then-insert approach also seen elsewhere. Extracting shared logic into a helper could simplify maintenance.
142-175: Maintain consistent error handling and logs.
Similar to_sql_update_cache(). Consider a uniform logging or exception strategy for easier debugging.
177-190: Unify redundant delete-insert flow.
This function also performs a delete-then-insert operation. Creating a unifying helper can reduce repeated code.
Line range hint
407-417: Repeated sample unit block.
This code is similar across protocols. Factoring out shared logic would improve clarity.
Line range hint
431-441: Consider merging BenthicPIT logic.
Check if this code can be part of a generalized solution for sample units.
Line range hint
443-453: Keep parameter patterns consistent.
The parameter ordering in_sql_update_cachecalls can be standardized for readability.
455-458: Bleaching QC summary mirrors other sample units.
Investigate combining these repeated steps into a common function.src/api/models/summary_sample_events.py (1)
Line range hint
125-450: Consider adding indexes for frequently joined columns.The query performs multiple joins on columns like
sample_event_id,project_id, andmanagement_id. Consider adding indexes if not already present to improve query performance.Suggested indexes:
CREATE INDEX IF NOT EXISTS idx_sample_event_site_id ON sample_event(site_id); CREATE INDEX IF NOT EXISTS idx_sample_event_management_id ON sample_event(management_id); CREATE INDEX IF NOT EXISTS idx_site_project_id ON site(project_id); CREATE INDEX IF NOT EXISTS idx_management_compliance_id ON management(compliance_id);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
src/api/models/sql_models/beltfish.py(1 hunks)src/api/models/sql_models/benthiclit.py(1 hunks)src/api/models/sql_models/benthicpit.py(1 hunks)src/api/models/sql_models/benthicpqt.py(1 hunks)src/api/models/sql_models/bleachingqc.py(1 hunks)src/api/models/sql_models/habitatcomplexity.py(1 hunks)src/api/models/summary_sample_events.py(1 hunks)src/api/utils/summary_cache.py(8 hunks)
🔇 Additional comments (16)
src/api/utils/summary_cache.py (7)
77-79: Looks good to me!
Function_columnsis straightforward and returns model fields for usage.
81-105: Ensure column matching and potential docstring.
If the subselect insqldoes not include all columns frommodel_cls, it may fail. Consider verifying column alignment or adding a fallback.
107-109: Sanitize critical parameters to prevent injection.
Whilecheck_uuidhelps mitigate the risk, ensure every code path correctly validatesproject_id.
307-307: Double check test project skip logic.
Early return might omit needed actions. Verify that this cannot produce partial data states.
Line range hint
419-429: LIT protocol update is consistent.
No significant issues spotted here.
Line range hint
461-471: Confirm model references for HabitatComplexity.
Ensure these models align with the new schema changes and are tested after the modifications.
472-472: Ensures project summary updates at the end.
Implementation looks correct.src/api/models/sql_models/habitatcomplexity.py (1)
134-134: Use ofuuid_generate_v4()is a solid choice.
It ensures unique identifiers. Confirm that the corresponding database extension is enabled.src/api/models/sql_models/benthicpit.py (1)
283-283: Switching from NULL touuid_generate_v4()for the id field is beneficial.
This improves data integrity by guaranteeing unique IDs.src/api/models/sql_models/benthicpqt.py (1)
292-292: LGTM! Using UUID for unique record identification.The change from
NULLtouuid_generate_v4()ensures each record has a unique identifier, improving data integrity.src/api/models/sql_models/benthiclit.py (1)
294-294: LGTM! Using UUID for unique record identification.The change from
NULLtouuid_generate_v4()ensures each record has a unique identifier, improving data integrity.src/api/models/sql_models/beltfish.py (2)
287-287: LGTM! Using UUID for unique record identification.The change from
NULLtouuid_generate_v4()ensures each record has a unique identifier, improving data integrity.
287-287: Verify the impact of increased biomass_kgha precision.The max_digits increase from 11 to 13 for biomass_kgha allows for larger values. Please ensure this change doesn't affect any downstream calculations or reports.
Run the following script to check for any potential overflow in existing biomass values:
✅ Verification successful
The biomass_kgha precision change is safe to implement
The increase from 11 to 13 digits in max_digits while maintaining 2 decimal places is safe because:
- All biomass calculations and aggregations remain unchanged
- No validation logic or downstream calculations depend on the field size
- The change only increases the maximum possible value without affecting precision
- Existing test cases use biomass values well within both old and new limits
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for potential overflow in biomass values # Test: Search for biomass values that might exceed the old limit rg -A 5 "biomass_kgha.*>.*99999999999" # Old limit was 11 digitsLength of output: 40
Script:
#!/bin/bash # Find table definition and biomass_kgha usage echo "=== Table Definition ===" rg -A 10 -B 10 "CREATE TABLE.*beltfish" echo -e "\n=== Biomass Usage in Calculations ===" rg -A 5 -B 5 "biomass_kgha" --type sql echo -e "\n=== Python Code Using Biomass ===" ast-grep --pattern 'biomass_kgha'Length of output: 837
Script:
#!/bin/bash # Get detailed field definitions and validation echo "=== DecimalField Definitions ===" rg -A 3 -B 1 "biomass_kgha.*=.*DecimalField" echo -e "\n=== Range Filter Usage ===" rg -A 5 -B 5 "class.*RangeFilter.*biomass_kgha" echo -e "\n=== Validation Logic ===" rg -A 5 "validate.*biomass"Length of output: 68720
Script:
#!/bin/bash # Search for biomass calculations and validations echo "=== Biomass Calculations ===" rg -A 5 "biomass.*=|biomass_kgha.*=" --type python echo -e "\n=== SQL Calculations ===" rg -A 5 "biomass_kgha" --type sql echo -e "\n=== Validation Logic ===" rg -A 5 "def clean.*biomass|validate.*biomass" --type pythonLength of output: 408
Script:
#!/bin/bash # Search for biomass calculations and validations without file type restrictions echo "=== Biomass Calculations ===" rg -A 5 "biomass.*=|biomass_kgha.*=" echo -e "\n=== SQL Calculations ===" rg -A 5 "SELECT.*biomass_kgha|UPDATE.*biomass_kgha|INSERT.*biomass_kgha" echo -e "\n=== Validation Logic ===" rg -A 5 "def clean.*biomass|validate.*biomass|def validate"Length of output: 65680
Script:
#!/bin/bash # Search for biomass calculations excluding JSON files echo "=== Biomass Field Definitions ===" fd -e py | xargs rg "biomass_kgha.*=.*DecimalField" echo -e "\n=== Biomass Calculations ===" fd -e py | xargs rg "biomass.*=|calculate.*biomass" echo -e "\n=== SQL Queries ===" fd -e sql | xargs rg "biomass_kgha"Length of output: 16472
src/api/models/sql_models/bleachingqc.py (1)
347-347: LGTM! Using UUID for unique record identification.The change from
NULLtouuid_generate_v4()ensures each record has a unique identifier, improving data integrity.src/api/models/summary_sample_events.py (2)
Line range hint
8-591: LGTM! Well-structured model hierarchy.The model structure follows Django best practices with proper use of abstract base classes, JSON fields, and model managers.
125-125: LGTM! Verify PostgreSQL extension requirement.The change from
NULLtouuid_generate_v4()for theidfield improves data integrity by ensuring unique identifiers.Run this script to verify the uuid-ossp extension is enabled:
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/api/utils/summary_cache.py (3)
107-108: Use parameter binding for safety and consistency.
Currently, theproject_idis string-interpolated into the SQL, which can be safe if strictly validated bycheck_uuid. However, adopting parameter binding (e.g., placeholders) is a more robust pattern.Below is a conceptual diff (requires redesign of how
_deleteis invoked):-def _delete(model_cls, project_id): - return f"DELETE FROM {model_cls._meta.db_table} WHERE project_id = '{project_id}';" +def _delete(model_cls): + return f"DELETE FROM {model_cls._meta.db_table} WHERE project_id = %s"
111-141: Multiple concatenated statements can obscure partial failures.
Joining several DELETE and INSERT statements for different tables into a singlecur.execute("".join(sql))can hide which part failed in case of errors. Consider separating them or wrapping each in atry/exceptfor easier debugging.
408-473: Consider reducing repetitive calls for each protocol.
Eachif sample_unit is None or sample_unit == X_PROTOCOL:block essentially calls_sql_update_cache(...)or_sql_update_bleaching_qc_summary(...)in a similar pattern. Extracting a map from protocol to function could streamline maintenance and reduce repetition.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/api/models/summary_sample_events.py(1 hunks)src/api/utils/summary_cache.py(8 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/api/models/summary_sample_events.py
🔇 Additional comments (5)
src/api/utils/summary_cache.py (5)
67-75: Consider potential size limitations withmogrifyoutput.
Whilemogrifysafely parameterizes the query, for extremely large or complex queries, it may generate a very large SQL string that risks exceeding memory or length limits. It's worth verifying that all upstream queries remain reasonably bounded by project-level filters.Would you like a script to check related usages of
_convert_to_sqlthroughout the codebase?
77-80: Good use of Django model introspection.
Retrieving field names viamodel_cls._meta.fieldsis a straightforward approach and helps keep_insertlogic flexible. This looks good.
81-105: Validate handling of default or auto-increment fields in_insertlogic.
By mapping extra columns likecreated_ontonow()selectively, the insert logic is carefully overriding certain fields. Ensure that auto-increment primary keys or fields with default constraints do not result in unexpected collisions or duplication.
142-175: Consistency issue: parallel functions for BleachingQC.
You have_sql_update_bleaching_qc_summaryplus the older_update_bleaching_qc_summary. Similar to a past comment on line 173, consider removing or consolidating one of them to maintain consistency.
177-190: Validate partial updates with skip logic.
Whenskip_updatesisTrue,_delete(...)calls are skipped. Ensure that subsequent inserts won’t lead to duplicate records if data already exists for the same project.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/api/utils/summary_cache.py (3)
67-75: Consider adding a docstring to_convert_to_sqlfor future maintainability.
It might also be helpful to integrate logging or robust error handling in case the generated SQL string needs debugging in production environments.
77-79: Add a brief explanation for_columnsusage.
While this function is straightforward, a short docstring clarifying its purpose would aid maintainers.
142-175: Keep the pattern consistent and add docstring.
_sql_update_bleaching_qc_summaryfollows the same pattern of delete-then-insert. Consider documenting the rationale for each step or clarifying the reliance on a single transaction context.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/api/utils/summary_cache.py(9 hunks)
🔇 Additional comments (12)
src/api/utils/summary_cache.py (12)
81-105: Validate insert logic against unexpected or invalid data.
Since this function dynamically builds an INSERT statement from an existing query, ensure data columns match. Consider if an upsert mechanism or additional constraints are needed to avoid duplicates.
107-108: Confirm the unconditional deletion approach.
_deleteremoves all rows for a givenproject_id. If partial deletions or archiving is required later, a more selective method might be necessary.
111-141: Evaluate transaction handling for large updates.
Performing bulk deletes and inserts in a single transaction is fine, but in the event of partial failures, all changes are rolled back. Confirm if this all-or-nothing approach aligns with business requirements.
177-189: Implementation looks consistent.
No immediate concerns with_sql_update_project_summary_sample_event. The logic is straightforward.
192-197: No functional issues spotted in_get_suggested_citation.
Keeping it as a separate function is fine. You might consider logging when a project is missing to aid debugging.
307-316: Double-checkskip_updatesusage for this path.
Ensure that skipping updates for FISHBELT is indeed intended when the project is in test status or meets other conditions.
319-328: No issues with BenthicLIT data flow.
This call to_sql_update_cachemaintains a consistent pattern.
331-340: Maintains consistent logic for BenthicPIT.
All arguments align with_sql_update_cache. No further changes needed.
343-352: Matches existing pattern for BenthicPhotoQuadratTransect.
No concerns. The code is uniform and clear.
355-358: Ensure bleaching QC summary checks align with skip logic.
Verify that calls to_sql_update_bleaching_qc_summaryproperly handleskip_test_projector partial-skip scenarios if needed.
361-370: No issues for HabitatComplexity updates.
The call to_sql_update_cacheis consistent.
372-372: Final summary sample event update is correct.
This call to_sql_update_project_summary_sample_eventcompletes the multi-step update process logically.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
src/api/utils/summary_cache.py (1)
111-140:⚠️ Potential issueSQL operations need to handle parameters securely
The implementation of
_sql_update_cachecollects SQL strings but doesn't handle the parameters that should accompany the SQL to prevent injection vulnerabilities. This needs to be updated to work with the parameter-based versions of_insertand_delete.You should refactor this to collect SQL and parameters separately and then execute them properly:
def _sql_update_cache(...): suggested_citation = _get_suggested_citation(project_id) - sql = [] + sql_parts = [] + params = [] if skip_updates is not True: - sql.append(_delete(obs_model, project_id)) - sql.append(_delete(su_model, project_id)) - sql.append(_delete(se_model, project_id)) + delete_sql, delete_params = _delete(obs_model, project_id) + sql_parts.append(delete_sql) + params.extend(delete_params) + # Similar changes for other _delete calls # Similar changes for _insert calls with connection.cursor() as cur: - cur.execute("".join(sql)) + # Execute with all collected parameters + cur.execute("".join(sql_parts), params)
🧹 Nitpick comments (1)
src/api/utils/summary_cache.py (1)
192-197: Function for suggested citation seems to be reintroducedThis function
_get_suggested_citationappears to be a reintroduction after it was possibly removed in a previous version. The function itself looks correct, but as noted in a past review comment, there seems to be inconsistency in how functions are handled - some are deleted and replaced with SQL equivalents while others are kept.For consistency, consider either:
- Keeping both SQL and non-SQL versions of all functions, or
- Replacing all non-SQL functions with SQL equivalents and removing the original functions.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/api/utils/summary_cache.py(10 hunks)
🔇 Additional comments (12)
src/api/utils/summary_cache.py (12)
67-75: Well-structured SQL conversion functionThe
_convert_to_sqlfunction effectively converts a Django queryset to a raw SQL string. The pattern of usingmogrifyto safely handle parameters prevents SQL injection vulnerabilities.
77-78: Useful helper for extracting column namesThis utility function provides a clean way to extract column names from a model class, which improves code readability throughout the module.
142-175: Same SQL injection concerns in_sql_update_bleaching_qc_summaryThis function has the same SQL injection vulnerabilities as the previous one. It needs to be refactored to properly handle parameters.
Additionally, there's a variable naming issue in lines 160-165. The variable
colonies_obs_sqlis reused for quadrat benthic percent observations, which could lead to confusion.- colonies_obs_sql = _convert_to_sql( + quadrat_obs_sql = _convert_to_sql( BleachingQCQuadratBenthicPercentObsSQLModel.objects.all().sql_table(project_id=project_id) ) sql.append( - _insert(BleachingQCQuadratBenthicPercentObsModel, colonies_obs_sql, suggested_citation) + _insert(BleachingQCQuadratBenthicPercentObsModel, quadrat_obs_sql, suggested_citation) )
177-190: SQL injection concerns in_sql_update_project_summary_sample_eventThis function has the same SQL injection vulnerabilities as mentioned in previous comments. The function should be refactored to handle parameters properly.
Also, there are extra blank lines after line 190.
A few extra lines here as noted in a previous review comment.
313-322: Appropriate update to use the new SQL functionThe change from the previous
_update_cacheto_sql_update_cacheis correctly implemented here.
325-334: Appropriate update to use the new SQL functionThe change from the previous
_update_cacheto_sql_update_cacheis correctly implemented here as well.
337-346: Appropriate update to use the new SQL functionThe change from the previous
_update_cacheto_sql_update_cacheis correctly implemented here as well.
349-358: Appropriate update to use the new SQL functionThe change from the previous
_update_cacheto_sql_update_cacheis correctly implemented here as well.
361-364: Appropriate update to use the new SQL functionThe change from the previous
_update_bleaching_qc_summaryto_sql_update_bleaching_qc_summaryis correctly implemented.
367-376: Appropriate update to use the new SQL functionThe change from the previous
_update_cacheto_sql_update_cacheis correctly implemented here as well.
378-378: Appropriate update to use the new SQL functionThe change from the previous
_update_project_summary_sample_eventto_sql_update_project_summary_sample_eventis correctly implemented.
284-289:Details
❓ Verification inconclusive
Behavior change in project queue checks
The code now filters projects with
attempts=3when checking for existing queue entries, which is a change from the previous behavior. Make sure this is intentional and consistent with the expected behavior of the queue system.
🏁 Script executed:
#!/bin/bash # Check for occurrences of the attempts filter elsewhere in the codebase rg -A 2 -B 2 "attempts\s*=\s*3" --glob "*.py" | grep -v "src/api/utils/summary_cache.py"Length of output: 88
Attention: Confirm the New Queue Filtering Criteria
The changes in
src/api/utils/summary_cache.py(lines 284–289) now filter queue entries withattempts=3before deletion. The grep search did not reveal any other instances of filtering byattempts=3, so this appears to be an isolated change. However, please verify manually that this new condition is intentional and that it aligns with the overall expected behavior of the queue system.
- Verify that filtering on
attempts=3is the correct logic for this specific queue behavior.- Ensure that this change doesn’t inadvertently affect other parts of the queue system.
| def _delete(model_cls, project_id): | ||
| return f"DELETE FROM {model_cls._meta.db_table} WHERE project_id = '{project_id}';" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential SQL injection risk in _delete function
The project_id parameter is directly interpolated into the SQL string, which could lead to SQL injection if project_id comes from an untrusted source. Consider using a parameterized query approach.
-def _delete(model_cls, project_id):
- return f"DELETE FROM {model_cls._meta.db_table} WHERE project_id = '{project_id}';"
+def _delete(model_cls, project_id):
+ return f"DELETE FROM {model_cls._meta.db_table} WHERE project_id = %s;", [project_id]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _delete(model_cls, project_id): | |
| return f"DELETE FROM {model_cls._meta.db_table} WHERE project_id = '{project_id}';" | |
| def _delete(model_cls, project_id): | |
| return f"DELETE FROM {model_cls._meta.db_table} WHERE project_id = %s;", [project_id] |
| def _insert(model_cls, sql, suggested_citation): | ||
| cols = _columns(model_cls) | ||
| extras = { | ||
| "created_on": "now()", | ||
| "suggested_citation": f"'{suggested_citation}'", | ||
| } | ||
|
|
||
| def _delete_existing_records(project_id, target_model_cls): | ||
| target_model_cls.objects.filter(project_id=project_id).delete() | ||
| insert_cols = ", ".join(cols) | ||
|
|
||
| # update any columns from cols that are in extras | ||
| updated_cols = [] | ||
| for col in cols: | ||
| if col not in extras: | ||
| updated_cols.append(col) | ||
| else: | ||
| updated_cols.append(f"{extras[col]} AS {col}") | ||
|
|
||
| def _update_records(records, target_model_cls, created_on, suggested_citation, skip_updates=False): | ||
| if skip_updates or not records: | ||
| return | ||
| idx = 0 | ||
| while True: | ||
| batch = records[idx : idx + BATCH_SIZE] | ||
| _set_created_on(created_on, batch) | ||
| _set_suggested_citation(suggested_citation, batch) | ||
| if not batch: | ||
| break | ||
| target_model_cls.objects.bulk_create(batch, batch_size=BATCH_SIZE) | ||
| idx += BATCH_SIZE | ||
| select_cols = ", ".join(updated_cols) | ||
|
|
||
| return f""" | ||
| INSERT INTO {model_cls._meta.db_table} | ||
| ({insert_cols}) | ||
| SELECT {select_cols} FROM ({sql}) AS foo; | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider parameterizing suggested_citation to prevent SQL injection
While most of the function looks well-designed, there's a potential SQL injection vulnerability in line 85 where suggested_citation is directly interpolated into the SQL string.
Apply this fix to prevent SQL injection:
- "suggested_citation": f"'{suggested_citation}'",
+ "suggested_citation": "%(suggested_citation)s",And update the return statement to include the parameter:
- return f"""
- INSERT INTO {model_cls._meta.db_table}
- ({insert_cols})
- SELECT {select_cols} FROM ({sql}) AS foo;
- """
+ return f"""
+ INSERT INTO {model_cls._meta.db_table}
+ ({insert_cols})
+ SELECT {select_cols} FROM ({sql}) AS foo;
+ """, {"suggested_citation": suggested_citation}Then update the calling code to handle the returned tuple of SQL and parameters.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _insert(model_cls, sql, suggested_citation): | |
| cols = _columns(model_cls) | |
| extras = { | |
| "created_on": "now()", | |
| "suggested_citation": f"'{suggested_citation}'", | |
| } | |
| def _delete_existing_records(project_id, target_model_cls): | |
| target_model_cls.objects.filter(project_id=project_id).delete() | |
| insert_cols = ", ".join(cols) | |
| # update any columns from cols that are in extras | |
| updated_cols = [] | |
| for col in cols: | |
| if col not in extras: | |
| updated_cols.append(col) | |
| else: | |
| updated_cols.append(f"{extras[col]} AS {col}") | |
| def _update_records(records, target_model_cls, created_on, suggested_citation, skip_updates=False): | |
| if skip_updates or not records: | |
| return | |
| idx = 0 | |
| while True: | |
| batch = records[idx : idx + BATCH_SIZE] | |
| _set_created_on(created_on, batch) | |
| _set_suggested_citation(suggested_citation, batch) | |
| if not batch: | |
| break | |
| target_model_cls.objects.bulk_create(batch, batch_size=BATCH_SIZE) | |
| idx += BATCH_SIZE | |
| select_cols = ", ".join(updated_cols) | |
| return f""" | |
| INSERT INTO {model_cls._meta.db_table} | |
| ({insert_cols}) | |
| SELECT {select_cols} FROM ({sql}) AS foo; | |
| """ | |
| def _insert(model_cls, sql, suggested_citation): | |
| cols = _columns(model_cls) | |
| extras = { | |
| "created_on": "now()", | |
| "suggested_citation": "%(suggested_citation)s", | |
| } | |
| insert_cols = ", ".join(cols) | |
| # update any columns from cols that are in extras | |
| updated_cols = [] | |
| for col in cols: | |
| if col not in extras: | |
| updated_cols.append(col) | |
| else: | |
| updated_cols.append(f"{extras[col]} AS {col}") | |
| select_cols = ", ".join(updated_cols) | |
| return f""" | |
| INSERT INTO {model_cls._meta.db_table} | |
| ({insert_cols}) | |
| SELECT {select_cols} FROM ({sql}) AS foo; | |
| """, {"suggested_citation": suggested_citation} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/api/utils/summary_cache.py (1)
109-111:⚠️ Potential issueUnresolved SQL-injection risk in
_delete()The project-id is still interpolated directly into the SQL string, leaving the door open for injection if an untrusted id ever reaches this path. A very similar issue was pointed out in previous reviews but has not been addressed. Switch to a parametrised statement so the driver handles quoting for you.
-def _delete(model_cls, project_id): - return f"DELETE FROM {model_cls._meta.db_table} WHERE project_id = '{project_id}';" +def _delete(model_cls, project_id): + # Return a tuple so the caller can pass sql & params straight to `execute` + return ( + f"DELETE FROM {model_cls._meta.db_table} WHERE project_id = %s;", + [project_id], + )
🧹 Nitpick comments (2)
src/api/utils/summary_cache.py (2)
140-142: Execute statements individually or viasql.split(';')for clarityConcatenating many statements into one string obscures error reporting (Postgres returns the first error only) and makes it harder to log/inspect what went wrong. Iterating and
execute()-ing each statement (or usingexecutemanywhere possible) improves observability and keeps the transaction semantics identical.
286-291: Only deleting rows withattempts = 3may leave stuck queue itemsIf an entry failed twice it will never be purged, so a subsequent call
will raise a unique-constraint conflict and silently skip queuing.
Considerattempts__gte=3(or resettingattemptsto 0) to guarantee a clean slate.
| "created_on": "now()", | ||
| "suggested_citation": f"'{escaped_suggested_citation}'", | ||
| } | ||
|
|
||
| def _delete_existing_records(project_id, target_model_cls): | ||
| target_model_cls.objects.filter(project_id=project_id).delete() | ||
| insert_cols = ", ".join(cols) | ||
|
|
||
| # update any columns from cols that are in extras | ||
| updated_cols = [] | ||
| for col in cols: | ||
| if col not in extras: | ||
| updated_cols.append(col) | ||
| else: | ||
| updated_cols.append(f"{extras[col]} AS {col}") | ||
|
|
||
| def _update_records(records, target_model_cls, created_on, suggested_citation, skip_updates=False): | ||
| if skip_updates or not records: | ||
| return | ||
| idx = 0 | ||
| while True: | ||
| batch = records[idx : idx + BATCH_SIZE] | ||
| _set_created_on(created_on, batch) | ||
| _set_suggested_citation(suggested_citation, batch) | ||
| if not batch: | ||
| break | ||
| target_model_cls.objects.bulk_create(batch, batch_size=BATCH_SIZE) | ||
| idx += BATCH_SIZE | ||
| select_cols = ", ".join(updated_cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Column-order coupling is brittle – derive SELECT list from _columns()
_insert() assumes that the column order produced by model_sql.objects.all().sql_table() exactly matches model_cls._meta.fields. If a field is reordered, added, or the SQL view changes, the INSERT … SELECT will silently mismatch columns, producing corrupt data.
Safer pattern:
- Build the
SELECTlist directly fromcols(usingfoo.column_name). - Append/overwrite extras in that same loop.
- Avoid
SELECT *.
That guarantees positional alignment.
| escaped_suggested_citation = adapt(suggested_citation).getquoted().decode("utf-8").strip("'") | ||
| cols = _columns(model_cls) | ||
| extras = { | ||
| "created_on": "now()", | ||
| "suggested_citation": f"'{escaped_suggested_citation}'", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested_citation=None becomes the string 'NULL'
adapt(None).getquoted() returns the literal NULL (without quotes).
After strip("'") you end up with the four-letter word NULL, which is then wrapped in quotes, storing the text 'NULL' instead of an actual SQL NULL.
Add an explicit check for None (and possibly blank strings) so the column
remains nullable.
-escaped_suggested_citation = adapt(suggested_citation).getquoted().decode("utf-8").strip("'")
-extras = {
- "created_on": "now()",
- "suggested_citation": f"'{escaped_suggested_citation}'",
-}
+if suggested_citation is None:
+ sc_value = "NULL"
+else:
+ sc_value = adapt(suggested_citation).getquoted().decode("utf-8")
+
+extras = {
+ "created_on": "now()",
+ "suggested_citation": sc_value,
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| escaped_suggested_citation = adapt(suggested_citation).getquoted().decode("utf-8").strip("'") | |
| cols = _columns(model_cls) | |
| extras = { | |
| "created_on": "now()", | |
| "suggested_citation": f"'{escaped_suggested_citation}'", | |
| } | |
| # Handle NULL specially so we don’t end up storing the literal string 'NULL' | |
| if suggested_citation is None: | |
| sc_value = "NULL" | |
| else: | |
| # adapt(...).getquoted() already includes surrounding quotes for strings | |
| sc_value = adapt(suggested_citation).getquoted().decode("utf-8") | |
| cols = _columns(model_cls) | |
| extras = { | |
| "created_on": "now()", | |
| "suggested_citation": sc_value, | |
| } |
Summary by CodeRabbit
New Features
Improvements
Technical Updates