feat: add build_table_metrics_query method and improve fetch_sample_value #343

AnshumanX304 · 2025-10-21T07:22:51Z

User description

…alues_from_database for MSSQL and Postgres

Fixes/Implements

Description

Summary Goes here.

Type of change

Delete irrelevant options.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Locally Tested
Needs Testing From Production

PR Type

Enhancement

Description

Add build_table_metrics_query method to MSSQL for computing table metrics
Implement fetch_sample_values_from_database in MSSQL with wildcard support
Enhance Postgres fetch_sample_values_from_database to return column names
Add value normalization for JSON serialization of database metrics

Diagram Walkthrough

flowchart LR
  A["MSSQL Database"] -->|"build_table_metrics_query"| B["Column Metrics<br/>distinct, null, min, max, avg"]
  A -->|"fetch_sample_values_from_database"| C["Sample Rows<br/>with Column Names"]
  D["Postgres Database"] -->|"fetch_sample_values_from_database"| E["Sample Rows<br/>with Column Names"]
  B -->|"_normalize_metrics"| F["JSON Serializable<br/>Values"]

File Walkthrough

Relevant files

Enhancement

mssql.py `Add metrics query builder and sample value fetching` dcs_core/integrations/databases/mssql.py Add imports for `Decimal`, `UUID`, and `text` from sqlalchemy Implement `build_table_metrics_query` method to compute column-level metrics (distinct count, duplicates, nulls, min, max, average for numeric types, max length for string types) Add `_normalize_metrics` helper function to safely convert database values (Decimal, UUID, datetime) to JSON-serializable types Implement `fetch_sample_values_from_database` method with support for wildcard column selection and proper cursor management	+130/-0
postgres.py `Enhance sample value fetching with column names` dcs_core/integrations/databases/postgres.py Add wildcard support to `fetch_sample_values_from_database` for selecting all columns with `*` Extract and return column names from query result metadata Change return type from single list to tuple of (rows, column_names)	+8/-3

Summary by CodeRabbit

New Features
- Compute per-column table metrics for MSSQL (distinct, duplicates, nulls, min/max/avg, text length) and return normalized results.
- Fetch sample rows from specified columns in MSSQL and PostgreSQL.
Behavior Change
- PostgreSQL sample queries now return both rows and column names alongside results.

…alues_from_database for MSSQL and Postgres

qodo-code-review · 2025-10-21T07:23:15Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided `- [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr>`
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
⚪	No custom compliance provided Follow the guide to enable custom compliance check.
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

coderabbitai · 2025-10-21T07:23:16Z

Walkthrough

Adds MSSQL data-source utilities: build_table_metrics_query (in-database aggregated per-column metrics with normalization) and fetch_sample_values_from_database (TOP N sampling returning rows and column names). Updates Postgres sampling to return (rows, column_names) and support "*" wildcard.

Changes

Cohort / File(s)	Change Summary
MSSQL metrics & sampling `dcs_core/integrations/databases/mssql.py`	Added `build_table_metrics_query(table_name, column_info, additional_queries=None)` which builds and executes a single aggregated SELECT producing per-column metrics (distinct, duplicates, nulls, min/max/avg for numerics, max char length for text), then normalizes values (Decimal→float, datetime/date→str, UUID→str, lists/dicts) and returns list[dict]. Added `fetch_sample_values_from_database(table_name, column_names, limit=5)` which runs a TOP N (or `SELECT ` for `""`) query and returns `(rows, column_names)`. New imports: `datetime`, `Decimal`, `UUID`, `sqlalchemy.text`.
Postgres sampling update `dcs_core/integrations/databases/postgres.py`	`fetch_sample_values_from_database(table_name, column_names, limit=5)` now supports wildcard `"*"` and returns a tuple `(rows, column_names)` by extracting column names from result metadata (behavior changed from returning rows-only).

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant MSSQL as MSSQL DataSource
    participant DB as Database
    participant Norm as Normalizer

    rect rgb(230,245,255)
    Note over Caller,MSSQL: build_table_metrics_query
    Caller->>MSSQL: build_table_metrics_query(table, column_info, additional_queries)
    MSSQL->>MSSQL: Build aggregated SELECT (per-column metrics + extras)
    MSSQL->>DB: Execute aggregated SELECT
    DB-->>MSSQL: Single aggregated row
    MSSQL->>Norm: Normalize Decimal/UUID/datetime -> JSON types
    Norm-->>MSSQL: Normalized values
    MSSQL-->>Caller: List[dict] per-column metrics
    end

    rect rgb(245,230,245)
    Note over Caller,MSSQL: fetch_sample_values_from_database
    Caller->>MSSQL: fetch_sample_values_from_database(table, columns, limit)
    MSSQL->>DB: Execute TOP N SELECT (or SELECT * for "*")
    DB-->>MSSQL: Rows + column metadata
    MSSQL-->>Caller: (rows, column_names)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: add fetch_sample_values_from_database method #332: Modifies PostgresDataSource.fetch_sample_values_from_database — touches the same Postgres sampling function and return-type change.
feat: add build_table_metrics_query method for enhanced column metrics #340: Implements build_table_metrics_query for Postgres — closely related per-column metrics and normalization logic.

Suggested labels

Review effort 2/5

Suggested reviewers

rishav2803
Ryuk-me

Poem

🐰 Hopping through queries with a twitch of my nose,

I tally the nulls where the tallies impose.
I fetch tiny samples and tidy them bright,
Decimal to float, UUIDs into light.
A rabbit's small audit — metrics all right.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description does not properly follow the required template structure. While the PR contains substantive custom sections (including a diagram walkthrough and detailed file walkthrough), the required template sections are incomplete: the "Fixes/Implements #" field is empty, the "Description" section still contains placeholder text ("Summary Goes here."), and the required checkboxes for "Type of change" and "How Has This Been Tested?" are not marked as completed according to the template format. Although the custom description sections provide good context about the changes, the failure to properly complete the mandated template structure constitutes a structural compliance issue.	The author should complete the required template sections by: filling in the "Fixes/Implements #" field with the relevant issue number (if applicable), replacing the placeholder "Summary Goes here." with an actual summary, marking the appropriate checkbox for "Type of change" (noting that "Enhancement" is mentioned separately in the PR), and marking the "How Has This Been Tested?" checkbox to indicate whether testing has been performed locally or requires production validation. The existing custom description sections can remain as supporting detail.
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "feat: add build_table_metrics_query method and improve fetch_sample_value" accurately reflects the main changes in the changeset. It clearly identifies the two primary additions: a new method for computing table metrics in MSSQL and improvements to the fetch_sample_values_from_database method. While "fetch_sample_value" appears truncated compared to the actual method name, the intent is sufficiently clear and the title effectively communicates the primary changes. The title is concise, uses proper semantic versioning format, and is specific enough for teammates reviewing history to understand the scope of work.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2025-10-21T07:24:23Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Move the normalization logic to a shared utility The `_normalize_metrics` function, which converts database values for JSON serialization, is currently defined within the `MssqlDataSource` class. It should be relocated to a shared module or the `SQLDataSource` base class to prevent code duplication and allow other database integrations to reuse it. Examples: dcs_core/integrations/databases/mssql.py [757-773] def _normalize_metrics(value): """Safely normalize DB metric values for JSON serialization.""" if value is None: return None if isinstance(value, Decimal): return float(value) if isinstance(value, (int, float, bool)): return value if isinstance(value, (datetime.datetime, datetime.date)): return value.isoformat() ... (clipped 7 lines) Solution Walkthrough: Before: class MssqlDataSource(SQLDataSource): def build_table_metrics_query(...): # ... result_row = cursor.fetchone() row = dict(zip(columns, result_row)) def _normalize_metrics(value): """Safely normalize DB metric values for JSON serialization.""" if isinstance(value, Decimal): return float(value) # ... other type conversions return str(value) for key, value in row.items(): # ... col_metrics[metric_name] = _normalize_metrics(value) # ... After: class SQLDataSource: def _normalize_metrics(self, value): """Safely normalize DB metric values for JSON serialization.""" if isinstance(value, Decimal): return float(value) # ... other type conversions return str(value) class MssqlDataSource(SQLDataSource): def build_table_metrics_query(...): # ... result_row = cursor.fetchone() row = dict(zip(columns, result_row)) for key, value in row.items(): # ... col_metrics[metric_name] = self._normalize_metrics(value) # ... Suggestion importance[1-10]: 7 __ Why: This is a strong architectural suggestion that correctly identifies the generic nature of the `_normalize_metrics` function and proposes moving it to a shared location to improve reusability and maintainability.	Medium
General	Refactor nested function into static method Move the nested function `_normalize_metrics` out of `build_table_metrics_query` and define it as a static method on the `MssqlDataSource` class to improve code structure and reusability. dcs_core/integrations/databases/mssql.py [757-773] +@staticmethod def _normalize_metrics(value): """Safely normalize DB metric values for JSON serialization.""" if value is None: return None if isinstance(value, Decimal): return float(value) if isinstance(value, (int, float, bool)): return value - if isinstance(value, (datetime.datetime, datetime.date)): + if isinstance(value, (datetime, datetime.date)): return value.isoformat() if isinstance(value, UUID): return str(value) if isinstance(value, list): - return [_normalize_metrics(v) for v in value] + return [MssqlDataSource._normalize_metrics(v) for v in value] if isinstance(value, dict): - return {k: _normalize_metrics(v) for k, v in value.items()} + return {k: MssqlDataSource._normalize_metrics(v) for k, v in value.items()} return str(value) Apply / Chat Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies that the nested function `_normalize_metrics` can be refactored into a static method to improve code structure and reusability, which is a valid code quality improvement.	Low
Update

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

dcs_core/integrations/databases/postgres.py (2)
291-319: Update docstring to reflect the changed return signature.

The return type changed from List[Tuple] to Tuple[List[Tuple], List[str]], but the docstring still states :return: List of row tuples. This is a breaking API change for existing callers.

Apply this diff to fix the docstring:
     def fetch_sample_values_from_database(
         self,
         table_name: str,
         column_names: list[str],
         limit: int = 5,
-    ) -> List[Tuple]:
+    ) -> Tuple[List[Tuple], List[str]]:
         """
         Fetch sample rows for specific columns from the given table.
 
         :param table_name: The name of the table.
         :param column_names: List of column names to fetch.
         :param limit: Number of rows to fetch.
-        :return: List of row tuples.
+        :return: Tuple of (list of row tuples, list of column names).
         """
338-340: Fix incorrect duplicate count formula.

The current formula COUNT(*) - COUNT(DISTINCT column) incorrectly counts NULL values as duplicates because COUNT(*) includes NULLs while COUNT(DISTINCT column) excludes them.

For a column with 10 total rows, 2 NULLs, and 5 distinct non-null values:

Current formula: 10 - 5 = 5 (incorrect, counts NULLs)

Correct formula: 8 - 5 = 3 (actual duplicates)

Apply this diff:
             query_parts.append(
-                f'COUNT(*) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"'
+                f'COUNT({self.quote_column(name)}) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"'
             )

🧹 Nitpick comments (3)

dcs_core/integrations/databases/mssql.py (3)
693-786: Consider extracting shared logic to reduce duplication.

The build_table_metrics_query implementation is nearly identical to the PostgreSQL version (lines 321-418 in postgres.py), differing only in:

Quoting style ([column] vs "column")

SQL dialect details (LEN vs CHAR_LENGTH, CAST(...AS FLOAT) vs direct AVG)

Consider extracting the common structure (query building, normalization logic) into a base class method with dialect-specific overrides for SQL generation.

755-755: Add strict=True to zip() for safety.

Without strict=True, mismatched lengths between columns and result_row will silently truncate data, potentially masking bugs.

Apply this diff:
-        row = dict(zip(columns, result_row))
+        row = dict(zip(columns, result_row, strict=True))
788-818: Consider shared interface for fetch_sample_values_from_database.

This method duplicates the structure and logic from postgres.py (lines 291-319), differing only in execution mechanism (pyodbc cursor vs SQLAlchemy text()). Similar to build_table_metrics_query, extracting common validation and structure to a base class would improve maintainability.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fec3972 and e9b1aa3.

📒 Files selected for processing (2)

dcs_core/integrations/databases/mssql.py (2 hunks)
dcs_core/integrations/databases/postgres.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

dcs_core/integrations/databases/mssql.py (1)

dcs_core/integrations/databases/postgres.py (5)

build_table_metrics_query (321-418)

quote_column (69-75)

qualified_table_name (59-67)

_normalize_metrics (376-405)

fetch_sample_values_from_database (291-319)

dcs_core/integrations/databases/postgres.py (1)

dcs_core/core/datasource/sql_datasource.py (2)

quote_column (148-154)

fetchall (117-120)

🪛 Ruff (0.14.1)

dcs_core/integrations/databases/mssql.py

755-755: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

805-805: Avoid specifying long messages outside the exception class

(TRY003)

808-808: Possible SQL injection vector through string-based query construction

(S608)

811-811: Possible SQL injection vector through string-based query construction

(S608)

dcs_core/integrations/databases/postgres.py

311-311: Possible SQL injection vector through string-based query construction

(S608)

314-314: Possible SQL injection vector through string-based query construction

(S608)

🔇 Additional comments (2)

dcs_core/integrations/databases/postgres.py (1)

321-418: Well-structured metrics computation with comprehensive type handling.

The method provides a clean interface for computing per-column metrics with proper type normalization for JSON serialization. The _normalize_metrics helper correctly handles Decimal→float, datetime→ISO string, and UUID→string conversions.

dcs_core/integrations/databases/mssql.py (1)

788-818: Implementation correctly mirrors PostgreSQL version.

The method properly uses pyodbc cursor API and maintains consistent behavior with the PostgreSQL implementation. The qualified table name and column quoting provide adequate protection against SQL injection despite static analysis warnings.

dcs_core/integrations/databases/mssql.py

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

dcs_core/integrations/databases/mssql.py (2)
689-692: Consider using timezone-aware datetime.

The datetime usage now correctly matches the module import. However, datetime.datetime.utcnow() is deprecated in Python 3.12+ in favor of datetime.datetime.now(datetime.timezone.utc).

Consider this update for future-proofing:
             if isinstance(updated_time, str):
                 updated_time = datetime.datetime.strptime(
                     updated_time, "%Y-%m-%d %H:%M:%S.%f"
                 )
-            return int((datetime.datetime.utcnow() - updated_time).total_seconds())
+            return int((datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None) - updated_time).total_seconds())
757-757: Add strict=True to zip() for safety.

Without strict=True, zip() will silently truncate if columns and result_row have mismatched lengths, potentially hiding data corruption.
-        row = dict(zip(columns, result_row))
+        row = dict(zip(columns, result_row, strict=True))
Note: strict parameter requires Python 3.10+. If supporting older versions, verify lengths manually.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9b1aa3 and 181c673.

📒 Files selected for processing (1)

dcs_core/integrations/databases/mssql.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dcs_core/integrations/databases/mssql.py (2)

dcs_core/integrations/databases/postgres.py (5)

build_table_metrics_query (321-418)

quote_column (69-75)

qualified_table_name (59-67)

_normalize_metrics (376-405)

fetch_sample_values_from_database (291-319)

dcs_core/core/datasource/sql_datasource.py (4)

quote_column (148-154)

qualified_table_name (130-138)

fetchone (122-125)

fetchall (117-120)

🪛 Ruff (0.14.1)

dcs_core/integrations/databases/mssql.py

757-757: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

807-807: Avoid specifying long messages outside the exception class

(TRY003)

810-810: Possible SQL injection vector through string-based query construction

(S608)

813-813: Possible SQL injection vector through string-based query construction

(S608)

🔇 Additional comments (5)

dcs_core/integrations/databases/mssql.py (5)

15-22: LGTM: Import fixes address previous issues.

The change to import datetime (line 15) correctly resolves the past review concern about datetime usage at lines 689-692 and 767. The additional imports (Decimal, UUID, text) are necessary for the new metric normalization and query execution functionality.

695-704: LGTM: Method signature and early validation.

The method signature matches the Postgres implementation and the early return for empty column_info prevents unnecessary query execution.

759-775: LGTM: Value normalization helper.

The _normalize_metrics helper correctly handles Decimal, datetime, UUID, and recursive structures for JSON serialization. The datetime module usage (line 767) is correct.

777-788: LGTM: Column-wise metric transformation.

The logic correctly groups metrics by column name prefix and constructs the expected output structure matching the Postgres implementation.

790-808: LGTM: Method signature and validation.

The method signature matches the Postgres implementation, returning both rows and column names. Input validation correctly prevents empty column lists.

coderabbitai · 2025-10-21T08:34:52Z

dcs_core/integrations/databases/mssql.py

+        for col in column_info:
+            name = col["column_name"]
+            dtype = col["data_type"].lower()
+
+            quoted_name = self.quote_column(name)
+
+            query_parts.append(f"COUNT(DISTINCT {quoted_name}) AS [{name}_distinct]")
+            query_parts.append(
+                f"COUNT({quoted_name}) - COUNT(DISTINCT {quoted_name}) AS [{name}_duplicate]"
+            )
+            query_parts.append(
+                f"SUM(CASE WHEN {quoted_name} IS NULL THEN 1 ELSE 0 END) AS [{name}_is_null]"
+            )
+
+            if dtype in (
+                "int",
+                "integer",
+                "bigint",
+                "smallint",
+                "tinyint",
+                "decimal",
+                "numeric",
+                "float",
+                "real",
+                "money",
+                "smallmoney",
+            ):
+                query_parts.append(f"MIN({quoted_name}) AS [{name}_min]")
+                query_parts.append(f"MAX({quoted_name}) AS [{name}_max]")
+                query_parts.append(
+                    f"AVG(CAST({quoted_name} AS FLOAT)) AS [{name}_average]"
+                )
+
+            elif dtype in ("varchar", "nvarchar", "char", "nchar", "text", "ntext"):
+                query_parts.append(
+                    f"MAX(LEN({quoted_name})) AS [{name}_max_character_length]"
+                )


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify duplicate count logic: excludes NULLs.

Line 713 uses COUNT(quoted_name) - COUNT(DISTINCT quoted_name) which excludes NULL values from the duplicate count. The Postgres implementation (line 341 in relevant snippets) uses COUNT(*) - COUNT(DISTINCT quoted_name), which includes the row count regardless of NULLs.

This difference means MSSQL will report fewer duplicates when NULL values exist. Ensure this behavior is intentional.

Run this script to verify the duplicate count behavior difference:

🏁 Script executed:

#!/bin/bash # Description: Check if other datasources use COUNT(*) or COUNT(column) for duplicate calculation # Search for duplicate count patterns in other database integrations rg -nP -A2 -B2 'duplicate.*COUNT' --type=py -g 'dcs_core/integrations/databases/*.py'

Length of output: 0

🏁 Script executed:

#!/bin/bash # Search for duplicate count logic in database integration files # Check both the MSSQL and Postgres implementations echo "=== MSSQL Implementation (lines 705-741) ===" sed -n '705,741p' dcs_core/integrations/databases/mssql.py echo -e "\n=== Postgres Implementation (checking around line 341) ===" sed -n '330,350p' dcs_core/integrations/databases/postgres.py echo -e "\n=== Search for COUNT patterns in all database files ===" rg -n 'COUNT\(' dcs_core/integrations/databases/ --type=py -B1 -A1 | grep -i duplicate

Length of output: 2500

🏁 Script executed:

#!/bin/bash # Find all database integration files and check their duplicate count implementations echo "=== Database integration files available ===" fd --type f -e py 'dcs_core/integrations/databases/' | sort echo -e "\n=== Duplicate count implementations across all databases ===" for file in dcs_core/integrations/databases/*.py; do if [ -f "$file" ]; then echo "File: $(basename $file)" grep -n "duplicate" "$file" | head -5 fi done

Length of output: 983

🏁 Script executed:

#!/bin/bash # Find all database integration files and check their duplicate count implementations echo "=== Database integration files available ===" fd --type f -e py 'dcs_core/integrations/databases/*.py' 2>/dev/null | sort echo -e "\n=== Duplicate count logic across all database integrations ===" for file in dcs_core/integrations/databases/*.py; do if [ -f "$file" ]; then basename=$(basename "$file") echo "=== $basename ===" grep -n "duplicate" "$file" -i -A1 | grep -E "(COUNT|duplicate)" | head -3 fi done

Length of output: 579

Inconsistent duplicate count logic between MSSQL and Postgres: MSSQL excludes NULLs.

Line 713 in MSSQL uses COUNT({quoted_name}) - COUNT(DISTINCT {quoted_name}), which excludes NULL values from the duplicate count. Postgres (line 339) uses COUNT(*) - COUNT(DISTINCT {quoted_name}), which includes all rows regardless of NULLs.

When NULL values exist in a column, MSSQL will report fewer duplicates than Postgres for the same data. This inconsistency should either be aligned across both implementations or explicitly documented as intentional behavior.

🤖 Prompt for AI Agents

dcs_core/integrations/databases/mssql.py around lines 705 to 741: the duplicate-count expression uses COUNT(column) - COUNT(DISTINCT column) which excludes NULLs and therefore differs from Postgres' COUNT(*) - COUNT(DISTINCT column) behavior; to fix, change the MSSQL expression to COUNT(*) - COUNT(DISTINCT {quoted_name}) so NULLs are treated the same as Postgres (or, if excluding NULLs is desired, make the Postgres implementation mirror this by using COUNT({quoted_name}) instead), and update any relevant comments/tests to reflect the chosen consistent behavior.

coderabbitai · 2025-10-21T08:34:52Z

dcs_core/integrations/databases/mssql.py

+        if additional_queries:
+            query_parts.extend(additional_queries)


⚠️ Potential issue | 🔴 Critical

Critical: SQL injection risk in additional_queries parameter.

The additional_queries list is directly extended into the query without validation or parameterization. This allows arbitrary SQL injection if the caller passes malicious queries.

Either:

Document that additional_queries must be pre-validated/sanitized by the caller, or

Add validation to ensure queries are safe, or

Remove this parameter if not strictly necessary.

Example mitigation (documentation approach):

def build_table_metrics_query( self, table_name: str, column_info: list[dict], additional_queries: Optional[List[str]] = None, ) -> list[dict]: + """ + Build and execute a query to compute column-level metrics. + + :param table_name: Name of the table + :param column_info: List of column metadata dicts with 'column_name' and 'data_type' + :param additional_queries: Pre-validated SQL fragments to extend SELECT list. + WARNING: Must be sanitized by caller to prevent SQL injection. + :return: List of dicts with column_name and metrics + """ query_parts = []

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In dcs_core/integrations/databases/mssql.py around lines 743-744, the code extends query_parts with additional_queries causing an SQL injection risk; to fix, either remove the additional_queries parameter or enforce/validate it: update the function to reject raw SQL strings by default (raise an error if additional_queries is provided), or require callers to supply only parameterized clause templates plus a corresponding parameters list and merge them safely, or implement strict validation (whitelist allowed statement types/patterns or a regex that only allows safe clause fragments like "AND <col> = %s") and raise ValueError on failure; also update the function docstring to document the new contract (what is accepted and that callers must provide parameters, not raw SQL).

dcs_core/integrations/databases/mssql.py

coderabbitai · 2025-10-21T08:34:53Z

dcs_core/integrations/databases/mssql.py

+        if len(column_names) == 1 and column_names[0] == "*":
+            query = f"SELECT TOP {limit} * FROM {qualified_table_name}"
+        else:
+            columns = ", ".join([self.quote_column(col) for col in column_names])
+            query = f"SELECT TOP {limit} {columns} FROM {qualified_table_name}"
+


⚠️ Potential issue | 🟠 Major

Validate limit parameter to prevent SQL injection.

While limit is typed as int, it's directly interpolated into the SQL string (lines 810, 813). If called with a malicious value through dynamic typing or type confusion, this could allow SQL injection.

Add explicit validation:

if not column_names: raise ValueError("At least one column name must be provided") + if not isinstance(limit, int) or limit < 1: + raise ValueError("Limit must be a positive integer") + if len(column_names) == 1 and column_names[0] == "*": query = f"SELECT TOP {limit} * FROM {qualified_table_name}"

🧰 Tools

🪛 Ruff (0.14.1)

810-810: Possible SQL injection vector through string-based query construction

(S608)

813-813: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents

In dcs_core/integrations/databases/mssql.py around lines 809 to 814, the TOP {limit} interpolation uses limit directly which can lead to SQL injection if a non-integer or malicious value is passed; validate and sanitize limit before formatting the query by coercing it to an integer inside a try/except (or checking isinstance), enforcing a positive range (e.g. 1..MAX_LIMIT) and raising a ValueError on invalid values, or clamping to allowed bounds; do this validation once before the if/else block so the constructed query only ever uses a safely validated integer.

dcs_core/integrations/databases/mssql.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

dcs_core/integrations/databases/mssql.py (3)
713-713: Duplicate count logic still inconsistent with Postgres.

This uses COUNT({quoted_name}) which excludes NULL values, while Postgres uses COUNT(*) (line 339 in postgres.py). This means MSSQL and Postgres will report different duplicate counts when NULL values are present in the column.

For consistency across database implementations, align the duplicate count calculation. Either:

Change MSSQL to use COUNT(*) - COUNT(DISTINCT {quoted_name}), or

Update Postgres to match MSSQL's behavior and document the decision.

743-744: SQL injection risk: additional_queries parameter is not validated.

The additional_queries list is directly extended into the query without any validation or sanitization. If a caller passes malicious SQL fragments, they will be executed directly.

Either:

Remove the parameter if not strictly necessary, or

Document that callers MUST pre-validate/sanitize these queries, or

Implement validation to reject unsafe SQL patterns.

809-813: Validate limit parameter to prevent potential SQL injection.

While limit is typed as int, it's directly interpolated into the SQL string. In dynamic typing scenarios or if the type hint is bypassed, this could allow SQL injection.

Add explicit validation before constructing the query.

Apply this diff:
         if not column_names:
             raise ValueError("At least one column name must be provided")
 
+        if not isinstance(limit, int) or limit < 1:
+            raise ValueError("limit must be a positive integer")
+
         if len(column_names) == 1 and column_names[0] == "*":
             query = f"SELECT TOP {limit} * FROM {qualified_table_name}"

🧹 Nitpick comments (1)

dcs_core/integrations/databases/mssql.py (1)
757-757: Consider adding explicit strict= parameter to zip().

Python 3.10+ supports strict=True in zip() to ensure both iterables have the same length. This prevents silent bugs if columns and result_row have mismatched lengths.

Apply this diff:
-        row = dict(zip(columns, result_row))
+        row = dict(zip(columns, result_row, strict=True))
Note: Only apply this if the minimum Python version is 3.10+. If supporting Python 3.9 or earlier, keep the current code.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 181c673 and 8305368.

📒 Files selected for processing (1)

dcs_core/integrations/databases/mssql.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dcs_core/integrations/databases/mssql.py (2)

dcs_core/integrations/databases/postgres.py (5)

build_table_metrics_query (321-418)

quote_column (69-75)

qualified_table_name (59-67)

_normalize_metrics (376-405)

fetch_sample_values_from_database (291-319)

dcs_core/core/datasource/sql_datasource.py (3)

quote_column (148-154)

qualified_table_name (130-138)

fetchall (117-120)

🪛 Ruff (0.14.1)

dcs_core/integrations/databases/mssql.py

757-757: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

807-807: Avoid specifying long messages outside the exception class

(TRY003)

810-810: Possible SQL injection vector through string-based query construction

(S608)

813-813: Possible SQL injection vector through string-based query construction

(S608)

🔇 Additional comments (5)

dcs_core/integrations/databases/mssql.py (5)

15-18: LGTM: Import changes are correct and consistent.

The datetime module import and additional imports (Decimal, UUID, text) are properly used throughout the new methods. The datetime module import correctly supports the qualified usage (datetime.datetime, datetime.date) in lines 689-692 and 767.

Also applies to: 22-22

689-692: LGTM: datetime usage correctly updated.

The datetime module is now properly qualified as datetime.datetime for both strptime and utcnow() calls, consistent with the module-level import.

759-775: LGTM: _normalize_metrics helper is well-implemented.

The normalization helper properly handles Decimal, datetime/date, UUID, and nested structures for JSON serialization. The logic is defensive and handles all expected database types correctly.

815-821: LGTM: Cursor resource management is properly implemented.

The try-finally block ensures the cursor is always closed, even if an exception occurs during query execution or result fetching.

790-822: Well-structured method with proper wildcard and return type support.

The method correctly:

Validates input parameters

Handles wildcard column selection ("*")

Extracts column names from query metadata

Returns a tuple matching the Postgres implementation signature

This ensures consistent behavior across database implementations.

feat: add build_table_metrics_query method and improve fetch_sample_v…

e9b1aa3

…alues_from_database for MSSQL and Postgres

qodo-code-review bot added the Review effort 3/5 label Oct 21, 2025

rishav2803 approved these changes Oct 21, 2025

View reviewed changes

coderabbitai bot reviewed Oct 21, 2025

View reviewed changes

dcs_core/integrations/databases/mssql.py Show resolved Hide resolved

AnshumanX304 added 2 commits October 21, 2025 14:00

fix: update datetime import and usage for consistency in MssqlDataSource

181c673

fix: minor fix

d02fa76

coderabbitai bot reviewed Oct 21, 2025

View reviewed changes

AnshumanX304 added 2 commits October 21, 2025 14:13

fix: ensure cursor is closed properly in fetch_rows and fetchall methods

01ae6d2

fix: minor bug

8305368

coderabbitai bot reviewed Oct 21, 2025

View reviewed changes

AnshumanX304 merged commit 6a27739 into datachecks:main Oct 21, 2025
5 of 7 checks passed

		if additional_queries:
		query_parts.extend(additional_queries)

feat: add build_table_metrics_query method and improve fetch_sample_value #343

feat: add build_table_metrics_query method and improve fetch_sample_value #343

Uh oh!

Conversation

AnshumanX304 commented Oct 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Fixes/Implements

Description

Type of change

How Has This Been Tested?

PR Type

Description

Diagram Walkthrough

File Walkthrough

Summary by CodeRabbit

Uh oh!

qodo-code-review bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Compliance Guide 🔍

Uh oh!

coderabbitai bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

qodo-code-review bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnshumanX304 commented Oct 21, 2025 •

edited by coderabbitai bot

Loading

qodo-code-review bot commented Oct 21, 2025 •

edited

Loading

coderabbitai bot commented Oct 21, 2025 •

edited

Loading

qodo-code-review bot commented Oct 21, 2025 •

edited

Loading