-
Notifications
You must be signed in to change notification settings - Fork 22
feat: add build_table_metrics_query method and improve fetch_sample_value #343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add build_table_metrics_query method and improve fetch_sample_value #343
Conversation
…alues_from_database for MSSQL and Postgres
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||
WalkthroughAdds MSSQL data-source utilities: Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant MSSQL as MSSQL DataSource
participant DB as Database
participant Norm as Normalizer
rect rgb(230,245,255)
Note over Caller,MSSQL: build_table_metrics_query
Caller->>MSSQL: build_table_metrics_query(table, column_info, additional_queries)
MSSQL->>MSSQL: Build aggregated SELECT (per-column metrics + extras)
MSSQL->>DB: Execute aggregated SELECT
DB-->>MSSQL: Single aggregated row
MSSQL->>Norm: Normalize Decimal/UUID/datetime -> JSON types
Norm-->>MSSQL: Normalized values
MSSQL-->>Caller: List[dict] per-column metrics
end
rect rgb(245,230,245)
Note over Caller,MSSQL: fetch_sample_values_from_database
Caller->>MSSQL: fetch_sample_values_from_database(table, columns, limit)
MSSQL->>DB: Execute TOP N SELECT (or SELECT * for "*")
DB-->>MSSQL: Rows + column metadata
MSSQL-->>Caller: (rows, column_names)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Code Suggestions ✨Explore these optional code suggestions:
|
||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
dcs_core/integrations/databases/postgres.py (2)
291-319: Update docstring to reflect the changed return signature.The return type changed from
List[Tuple]toTuple[List[Tuple], List[str]], but the docstring still states:return: List of row tuples.This is a breaking API change for existing callers.Apply this diff to fix the docstring:
def fetch_sample_values_from_database( self, table_name: str, column_names: list[str], limit: int = 5, - ) -> List[Tuple]: + ) -> Tuple[List[Tuple], List[str]]: """ Fetch sample rows for specific columns from the given table. :param table_name: The name of the table. :param column_names: List of column names to fetch. :param limit: Number of rows to fetch. - :return: List of row tuples. + :return: Tuple of (list of row tuples, list of column names). """
338-340: Fix incorrect duplicate count formula.The current formula
COUNT(*) - COUNT(DISTINCT column)incorrectly counts NULL values as duplicates becauseCOUNT(*)includes NULLs whileCOUNT(DISTINCT column)excludes them.For a column with 10 total rows, 2 NULLs, and 5 distinct non-null values:
- Current formula: 10 - 5 = 5 (incorrect, counts NULLs)
- Correct formula: 8 - 5 = 3 (actual duplicates)
Apply this diff:
query_parts.append( - f'COUNT(*) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"' + f'COUNT({self.quote_column(name)}) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"' )
🧹 Nitpick comments (3)
dcs_core/integrations/databases/mssql.py (3)
693-786: Consider extracting shared logic to reduce duplication.The
build_table_metrics_queryimplementation is nearly identical to the PostgreSQL version (lines 321-418 in postgres.py), differing only in:
- Quoting style (
[column]vs"column")- SQL dialect details (
LENvsCHAR_LENGTH,CAST(...AS FLOAT)vs directAVG)Consider extracting the common structure (query building, normalization logic) into a base class method with dialect-specific overrides for SQL generation.
755-755: Addstrict=Trueto zip() for safety.Without
strict=True, mismatched lengths betweencolumnsandresult_rowwill silently truncate data, potentially masking bugs.Apply this diff:
- row = dict(zip(columns, result_row)) + row = dict(zip(columns, result_row, strict=True))
788-818: Consider shared interface for fetch_sample_values_from_database.This method duplicates the structure and logic from postgres.py (lines 291-319), differing only in execution mechanism (pyodbc cursor vs SQLAlchemy text()). Similar to
build_table_metrics_query, extracting common validation and structure to a base class would improve maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
dcs_core/integrations/databases/mssql.py(2 hunks)dcs_core/integrations/databases/postgres.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
dcs_core/integrations/databases/mssql.py (1)
dcs_core/integrations/databases/postgres.py (5)
build_table_metrics_query(321-418)quote_column(69-75)qualified_table_name(59-67)_normalize_metrics(376-405)fetch_sample_values_from_database(291-319)
dcs_core/integrations/databases/postgres.py (1)
dcs_core/core/datasource/sql_datasource.py (2)
quote_column(148-154)fetchall(117-120)
🪛 Ruff (0.14.1)
dcs_core/integrations/databases/mssql.py
755-755: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
805-805: Avoid specifying long messages outside the exception class
(TRY003)
808-808: Possible SQL injection vector through string-based query construction
(S608)
811-811: Possible SQL injection vector through string-based query construction
(S608)
dcs_core/integrations/databases/postgres.py
311-311: Possible SQL injection vector through string-based query construction
(S608)
314-314: Possible SQL injection vector through string-based query construction
(S608)
🔇 Additional comments (2)
dcs_core/integrations/databases/postgres.py (1)
321-418: Well-structured metrics computation with comprehensive type handling.The method provides a clean interface for computing per-column metrics with proper type normalization for JSON serialization. The
_normalize_metricshelper correctly handles Decimal→float, datetime→ISO string, and UUID→string conversions.dcs_core/integrations/databases/mssql.py (1)
788-818: Implementation correctly mirrors PostgreSQL version.The method properly uses pyodbc cursor API and maintains consistent behavior with the PostgreSQL implementation. The qualified table name and column quoting provide adequate protection against SQL injection despite static analysis warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (2)
dcs_core/integrations/databases/mssql.py (2)
689-692: Consider using timezone-aware datetime.The datetime usage now correctly matches the module import. However,
datetime.datetime.utcnow()is deprecated in Python 3.12+ in favor ofdatetime.datetime.now(datetime.timezone.utc).Consider this update for future-proofing:
if isinstance(updated_time, str): updated_time = datetime.datetime.strptime( updated_time, "%Y-%m-%d %H:%M:%S.%f" ) - return int((datetime.datetime.utcnow() - updated_time).total_seconds()) + return int((datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None) - updated_time).total_seconds())
757-757: Add strict=True to zip() for safety.Without
strict=True,zip()will silently truncate ifcolumnsandresult_rowhave mismatched lengths, potentially hiding data corruption.- row = dict(zip(columns, result_row)) + row = dict(zip(columns, result_row, strict=True))Note:
strictparameter requires Python 3.10+. If supporting older versions, verify lengths manually.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/mssql.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/mssql.py (2)
dcs_core/integrations/databases/postgres.py (5)
build_table_metrics_query(321-418)quote_column(69-75)qualified_table_name(59-67)_normalize_metrics(376-405)fetch_sample_values_from_database(291-319)dcs_core/core/datasource/sql_datasource.py (4)
quote_column(148-154)qualified_table_name(130-138)fetchone(122-125)fetchall(117-120)
🪛 Ruff (0.14.1)
dcs_core/integrations/databases/mssql.py
757-757: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
807-807: Avoid specifying long messages outside the exception class
(TRY003)
810-810: Possible SQL injection vector through string-based query construction
(S608)
813-813: Possible SQL injection vector through string-based query construction
(S608)
🔇 Additional comments (5)
dcs_core/integrations/databases/mssql.py (5)
15-22: LGTM: Import fixes address previous issues.The change to
import datetime(line 15) correctly resolves the past review concern about datetime usage at lines 689-692 and 767. The additional imports (Decimal, UUID, text) are necessary for the new metric normalization and query execution functionality.
695-704: LGTM: Method signature and early validation.The method signature matches the Postgres implementation and the early return for empty
column_infoprevents unnecessary query execution.
759-775: LGTM: Value normalization helper.The
_normalize_metricshelper correctly handles Decimal, datetime, UUID, and recursive structures for JSON serialization. The datetime module usage (line 767) is correct.
777-788: LGTM: Column-wise metric transformation.The logic correctly groups metrics by column name prefix and constructs the expected output structure matching the Postgres implementation.
790-808: LGTM: Method signature and validation.The method signature matches the Postgres implementation, returning both rows and column names. Input validation correctly prevents empty column lists.
| for col in column_info: | ||
| name = col["column_name"] | ||
| dtype = col["data_type"].lower() | ||
|
|
||
| quoted_name = self.quote_column(name) | ||
|
|
||
| query_parts.append(f"COUNT(DISTINCT {quoted_name}) AS [{name}_distinct]") | ||
| query_parts.append( | ||
| f"COUNT({quoted_name}) - COUNT(DISTINCT {quoted_name}) AS [{name}_duplicate]" | ||
| ) | ||
| query_parts.append( | ||
| f"SUM(CASE WHEN {quoted_name} IS NULL THEN 1 ELSE 0 END) AS [{name}_is_null]" | ||
| ) | ||
|
|
||
| if dtype in ( | ||
| "int", | ||
| "integer", | ||
| "bigint", | ||
| "smallint", | ||
| "tinyint", | ||
| "decimal", | ||
| "numeric", | ||
| "float", | ||
| "real", | ||
| "money", | ||
| "smallmoney", | ||
| ): | ||
| query_parts.append(f"MIN({quoted_name}) AS [{name}_min]") | ||
| query_parts.append(f"MAX({quoted_name}) AS [{name}_max]") | ||
| query_parts.append( | ||
| f"AVG(CAST({quoted_name} AS FLOAT)) AS [{name}_average]" | ||
| ) | ||
|
|
||
| elif dtype in ("varchar", "nvarchar", "char", "nchar", "text", "ntext"): | ||
| query_parts.append( | ||
| f"MAX(LEN({quoted_name})) AS [{name}_max_character_length]" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify duplicate count logic: excludes NULLs.
Line 713 uses COUNT(quoted_name) - COUNT(DISTINCT quoted_name) which excludes NULL values from the duplicate count. The Postgres implementation (line 341 in relevant snippets) uses COUNT(*) - COUNT(DISTINCT quoted_name), which includes the row count regardless of NULLs.
This difference means MSSQL will report fewer duplicates when NULL values exist. Ensure this behavior is intentional.
Run this script to verify the duplicate count behavior difference:
🏁 Script executed:
#!/bin/bash
# Description: Check if other datasources use COUNT(*) or COUNT(column) for duplicate calculation
# Search for duplicate count patterns in other database integrations
rg -nP -A2 -B2 'duplicate.*COUNT' --type=py -g 'dcs_core/integrations/databases/*.py'Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for duplicate count logic in database integration files
# Check both the MSSQL and Postgres implementations
echo "=== MSSQL Implementation (lines 705-741) ==="
sed -n '705,741p' dcs_core/integrations/databases/mssql.py
echo -e "\n=== Postgres Implementation (checking around line 341) ==="
sed -n '330,350p' dcs_core/integrations/databases/postgres.py
echo -e "\n=== Search for COUNT patterns in all database files ==="
rg -n 'COUNT\(' dcs_core/integrations/databases/ --type=py -B1 -A1 | grep -i duplicateLength of output: 2500
🏁 Script executed:
#!/bin/bash
# Find all database integration files and check their duplicate count implementations
echo "=== Database integration files available ==="
fd --type f -e py 'dcs_core/integrations/databases/' | sort
echo -e "\n=== Duplicate count implementations across all databases ==="
for file in dcs_core/integrations/databases/*.py; do
if [ -f "$file" ]; then
echo "File: $(basename $file)"
grep -n "duplicate" "$file" | head -5
fi
doneLength of output: 983
🏁 Script executed:
#!/bin/bash
# Find all database integration files and check their duplicate count implementations
echo "=== Database integration files available ==="
fd --type f -e py 'dcs_core/integrations/databases/*.py' 2>/dev/null | sort
echo -e "\n=== Duplicate count logic across all database integrations ==="
for file in dcs_core/integrations/databases/*.py; do
if [ -f "$file" ]; then
basename=$(basename "$file")
echo "=== $basename ==="
grep -n "duplicate" "$file" -i -A1 | grep -E "(COUNT|duplicate)" | head -3
fi
doneLength of output: 579
Inconsistent duplicate count logic between MSSQL and Postgres: MSSQL excludes NULLs.
Line 713 in MSSQL uses COUNT({quoted_name}) - COUNT(DISTINCT {quoted_name}), which excludes NULL values from the duplicate count. Postgres (line 339) uses COUNT(*) - COUNT(DISTINCT {quoted_name}), which includes all rows regardless of NULLs.
When NULL values exist in a column, MSSQL will report fewer duplicates than Postgres for the same data. This inconsistency should either be aligned across both implementations or explicitly documented as intentional behavior.
🤖 Prompt for AI Agents
dcs_core/integrations/databases/mssql.py around lines 705 to 741: the
duplicate-count expression uses COUNT(column) - COUNT(DISTINCT column) which
excludes NULLs and therefore differs from Postgres' COUNT(*) - COUNT(DISTINCT
column) behavior; to fix, change the MSSQL expression to COUNT(*) -
COUNT(DISTINCT {quoted_name}) so NULLs are treated the same as Postgres (or, if
excluding NULLs is desired, make the Postgres implementation mirror this by
using COUNT({quoted_name}) instead), and update any relevant comments/tests to
reflect the chosen consistent behavior.
| if additional_queries: | ||
| query_parts.extend(additional_queries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: SQL injection risk in additional_queries parameter.
The additional_queries list is directly extended into the query without validation or parameterization. This allows arbitrary SQL injection if the caller passes malicious queries.
Either:
- Document that
additional_queriesmust be pre-validated/sanitized by the caller, or - Add validation to ensure queries are safe, or
- Remove this parameter if not strictly necessary.
Example mitigation (documentation approach):
def build_table_metrics_query(
self,
table_name: str,
column_info: list[dict],
additional_queries: Optional[List[str]] = None,
) -> list[dict]:
+ """
+ Build and execute a query to compute column-level metrics.
+
+ :param table_name: Name of the table
+ :param column_info: List of column metadata dicts with 'column_name' and 'data_type'
+ :param additional_queries: Pre-validated SQL fragments to extend SELECT list.
+ WARNING: Must be sanitized by caller to prevent SQL injection.
+ :return: List of dicts with column_name and metrics
+ """
query_parts = []Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In dcs_core/integrations/databases/mssql.py around lines 743-744, the code
extends query_parts with additional_queries causing an SQL injection risk; to
fix, either remove the additional_queries parameter or enforce/validate it:
update the function to reject raw SQL strings by default (raise an error if
additional_queries is provided), or require callers to supply only parameterized
clause templates plus a corresponding parameters list and merge them safely, or
implement strict validation (whitelist allowed statement types/patterns or a
regex that only allows safe clause fragments like "AND <col> = %s") and raise
ValueError on failure; also update the function docstring to document the new
contract (what is accepted and that callers must provide parameters, not raw
SQL).
| if len(column_names) == 1 and column_names[0] == "*": | ||
| query = f"SELECT TOP {limit} * FROM {qualified_table_name}" | ||
| else: | ||
| columns = ", ".join([self.quote_column(col) for col in column_names]) | ||
| query = f"SELECT TOP {limit} {columns} FROM {qualified_table_name}" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate limit parameter to prevent SQL injection.
While limit is typed as int, it's directly interpolated into the SQL string (lines 810, 813). If called with a malicious value through dynamic typing or type confusion, this could allow SQL injection.
Add explicit validation:
if not column_names:
raise ValueError("At least one column name must be provided")
+ if not isinstance(limit, int) or limit < 1:
+ raise ValueError("Limit must be a positive integer")
+
if len(column_names) == 1 and column_names[0] == "*":
query = f"SELECT TOP {limit} * FROM {qualified_table_name}"🧰 Tools
🪛 Ruff (0.14.1)
810-810: Possible SQL injection vector through string-based query construction
(S608)
813-813: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
In dcs_core/integrations/databases/mssql.py around lines 809 to 814, the TOP
{limit} interpolation uses limit directly which can lead to SQL injection if a
non-integer or malicious value is passed; validate and sanitize limit before
formatting the query by coercing it to an integer inside a try/except (or
checking isinstance), enforcing a positive range (e.g. 1..MAX_LIMIT) and raising
a ValueError on invalid values, or clamping to allowed bounds; do this
validation once before the if/else block so the constructed query only ever uses
a safely validated integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
dcs_core/integrations/databases/mssql.py (3)
713-713: Duplicate count logic still inconsistent with Postgres.This uses
COUNT({quoted_name})which excludes NULL values, while Postgres usesCOUNT(*)(line 339 in postgres.py). This means MSSQL and Postgres will report different duplicate counts when NULL values are present in the column.For consistency across database implementations, align the duplicate count calculation. Either:
- Change MSSQL to use
COUNT(*) - COUNT(DISTINCT {quoted_name}), or- Update Postgres to match MSSQL's behavior and document the decision.
743-744: SQL injection risk: additional_queries parameter is not validated.The
additional_querieslist is directly extended into the query without any validation or sanitization. If a caller passes malicious SQL fragments, they will be executed directly.Either:
- Remove the parameter if not strictly necessary, or
- Document that callers MUST pre-validate/sanitize these queries, or
- Implement validation to reject unsafe SQL patterns.
809-813: Validate limit parameter to prevent potential SQL injection.While
limitis typed asint, it's directly interpolated into the SQL string. In dynamic typing scenarios or if the type hint is bypassed, this could allow SQL injection.Add explicit validation before constructing the query.
Apply this diff:
if not column_names: raise ValueError("At least one column name must be provided") + if not isinstance(limit, int) or limit < 1: + raise ValueError("limit must be a positive integer") + if len(column_names) == 1 and column_names[0] == "*": query = f"SELECT TOP {limit} * FROM {qualified_table_name}"
🧹 Nitpick comments (1)
dcs_core/integrations/databases/mssql.py (1)
757-757: Consider adding explicitstrict=parameter tozip().Python 3.10+ supports
strict=Trueinzip()to ensure both iterables have the same length. This prevents silent bugs ifcolumnsandresult_rowhave mismatched lengths.Apply this diff:
- row = dict(zip(columns, result_row)) + row = dict(zip(columns, result_row, strict=True))Note: Only apply this if the minimum Python version is 3.10+. If supporting Python 3.9 or earlier, keep the current code.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/mssql.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/mssql.py (2)
dcs_core/integrations/databases/postgres.py (5)
build_table_metrics_query(321-418)quote_column(69-75)qualified_table_name(59-67)_normalize_metrics(376-405)fetch_sample_values_from_database(291-319)dcs_core/core/datasource/sql_datasource.py (3)
quote_column(148-154)qualified_table_name(130-138)fetchall(117-120)
🪛 Ruff (0.14.1)
dcs_core/integrations/databases/mssql.py
757-757: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
807-807: Avoid specifying long messages outside the exception class
(TRY003)
810-810: Possible SQL injection vector through string-based query construction
(S608)
813-813: Possible SQL injection vector through string-based query construction
(S608)
🔇 Additional comments (5)
dcs_core/integrations/databases/mssql.py (5)
15-18: LGTM: Import changes are correct and consistent.The datetime module import and additional imports (Decimal, UUID, text) are properly used throughout the new methods. The datetime module import correctly supports the qualified usage (datetime.datetime, datetime.date) in lines 689-692 and 767.
Also applies to: 22-22
689-692: LGTM: datetime usage correctly updated.The datetime module is now properly qualified as
datetime.datetimefor bothstrptimeandutcnow()calls, consistent with the module-level import.
759-775: LGTM: _normalize_metrics helper is well-implemented.The normalization helper properly handles Decimal, datetime/date, UUID, and nested structures for JSON serialization. The logic is defensive and handles all expected database types correctly.
815-821: LGTM: Cursor resource management is properly implemented.The try-finally block ensures the cursor is always closed, even if an exception occurs during query execution or result fetching.
790-822: Well-structured method with proper wildcard and return type support.The method correctly:
- Validates input parameters
- Handles wildcard column selection ("*")
- Extracts column names from query metadata
- Returns a tuple matching the Postgres implementation signature
This ensures consistent behavior across database implementations.
User description
…alues_from_database for MSSQL and Postgres
Fixes/Implements
Description
Summary Goes here.
Type of change
Delete irrelevant options.
How Has This Been Tested?
PR Type
Enhancement
Description
Add
build_table_metrics_querymethod to MSSQL for computing table metricsImplement
fetch_sample_values_from_databasein MSSQL with wildcard supportEnhance Postgres
fetch_sample_values_from_databaseto return column namesAdd value normalization for JSON serialization of database metrics
Diagram Walkthrough
File Walkthrough
mssql.py
Add metrics query builder and sample value fetchingdcs_core/integrations/databases/mssql.py
Decimal,UUID, andtextfrom sqlalchemybuild_table_metrics_querymethod to compute column-levelmetrics (distinct count, duplicates, nulls, min, max, average for
numeric types, max length for string types)
_normalize_metricshelper function to safely convert databasevalues (Decimal, UUID, datetime) to JSON-serializable types
fetch_sample_values_from_databasemethod with support forwildcard column selection and proper cursor management
postgres.py
Enhance sample value fetching with column namesdcs_core/integrations/databases/postgres.py
fetch_sample_values_from_databaseforselecting all columns with
*Summary by CodeRabbit
New Features
Behavior Change