Skip to content

Conversation

@AnshumanX304
Copy link
Contributor

@AnshumanX304 AnshumanX304 commented Oct 21, 2025

User description

…alues_from_database for MSSQL and Postgres

Fixes/Implements

Description

Summary Goes here.

Type of change

Delete irrelevant options.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Locally Tested
  • Needs Testing From Production

PR Type

Enhancement


Description

  • Add build_table_metrics_query method to MSSQL for computing table metrics

  • Implement fetch_sample_values_from_database in MSSQL with wildcard support

  • Enhance Postgres fetch_sample_values_from_database to return column names

  • Add value normalization for JSON serialization of database metrics


Diagram Walkthrough

flowchart LR
  A["MSSQL Database"] -->|"build_table_metrics_query"| B["Column Metrics<br/>distinct, null, min, max, avg"]
  A -->|"fetch_sample_values_from_database"| C["Sample Rows<br/>with Column Names"]
  D["Postgres Database"] -->|"fetch_sample_values_from_database"| E["Sample Rows<br/>with Column Names"]
  B -->|"_normalize_metrics"| F["JSON Serializable<br/>Values"]
Loading

File Walkthrough

Relevant files
Enhancement
mssql.py
Add metrics query builder and sample value fetching           

dcs_core/integrations/databases/mssql.py

  • Add imports for Decimal, UUID, and text from sqlalchemy
  • Implement build_table_metrics_query method to compute column-level
    metrics (distinct count, duplicates, nulls, min, max, average for
    numeric types, max length for string types)
  • Add _normalize_metrics helper function to safely convert database
    values (Decimal, UUID, datetime) to JSON-serializable types
  • Implement fetch_sample_values_from_database method with support for
    wildcard column selection and proper cursor management
+130/-0 
postgres.py
Enhance sample value fetching with column names                   

dcs_core/integrations/databases/postgres.py

  • Add wildcard support to fetch_sample_values_from_database for
    selecting all columns with *
  • Extract and return column names from query result metadata
  • Change return type from single list to tuple of (rows, column_names)
+8/-3     

Summary by CodeRabbit

  • New Features

    • Compute per-column table metrics for MSSQL (distinct, duplicates, nulls, min/max/avg, text length) and return normalized results.
    • Fetch sample rows from specified columns in MSSQL and PostgreSQL.
  • Behavior Change

    • PostgreSQL sample queries now return both rows and column names alongside results.

@qodo-code-review
Copy link

qodo-code-review bot commented Oct 21, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
- [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true -->

</details></td></tr>
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@coderabbitai
Copy link

coderabbitai bot commented Oct 21, 2025

Walkthrough

Adds MSSQL data-source utilities: build_table_metrics_query (in-database aggregated per-column metrics with normalization) and fetch_sample_values_from_database (TOP N sampling returning rows and column names). Updates Postgres sampling to return (rows, column_names) and support "*" wildcard.

Changes

Cohort / File(s) Change Summary
MSSQL metrics & sampling
dcs_core/integrations/databases/mssql.py
Added build_table_metrics_query(table_name, column_info, additional_queries=None) which builds and executes a single aggregated SELECT producing per-column metrics (distinct, duplicates, nulls, min/max/avg for numerics, max char length for text), then normalizes values (Decimal→float, datetime/date→str, UUID→str, lists/dicts) and returns list[dict]. Added fetch_sample_values_from_database(table_name, column_names, limit=5) which runs a TOP N (or SELECT * for "*") query and returns (rows, column_names). New imports: datetime, Decimal, UUID, sqlalchemy.text.
Postgres sampling update
dcs_core/integrations/databases/postgres.py
fetch_sample_values_from_database(table_name, column_names, limit=5) now supports wildcard "*" and returns a tuple (rows, column_names) by extracting column names from result metadata (behavior changed from returning rows-only).

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant MSSQL as MSSQL DataSource
    participant DB as Database
    participant Norm as Normalizer

    rect rgb(230,245,255)
    Note over Caller,MSSQL: build_table_metrics_query
    Caller->>MSSQL: build_table_metrics_query(table, column_info, additional_queries)
    MSSQL->>MSSQL: Build aggregated SELECT (per-column metrics + extras)
    MSSQL->>DB: Execute aggregated SELECT
    DB-->>MSSQL: Single aggregated row
    MSSQL->>Norm: Normalize Decimal/UUID/datetime -> JSON types
    Norm-->>MSSQL: Normalized values
    MSSQL-->>Caller: List[dict] per-column metrics
    end

    rect rgb(245,230,245)
    Note over Caller,MSSQL: fetch_sample_values_from_database
    Caller->>MSSQL: fetch_sample_values_from_database(table, columns, limit)
    MSSQL->>DB: Execute TOP N SELECT (or SELECT * for "*")
    DB-->>MSSQL: Rows + column metadata
    MSSQL-->>Caller: (rows, column_names)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Review effort 2/5

Suggested reviewers

  • rishav2803
  • Ryuk-me

Poem

🐰 Hopping through queries with a twitch of my nose,

I tally the nulls where the tallies impose.
I fetch tiny samples and tidy them bright,
Decimal to float, UUIDs into light.
A rabbit's small audit — metrics all right.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description does not properly follow the required template structure. While the PR contains substantive custom sections (including a diagram walkthrough and detailed file walkthrough), the required template sections are incomplete: the "Fixes/Implements #" field is empty, the "Description" section still contains placeholder text ("Summary Goes here."), and the required checkboxes for "Type of change" and "How Has This Been Tested?" are not marked as completed according to the template format. Although the custom description sections provide good context about the changes, the failure to properly complete the mandated template structure constitutes a structural compliance issue. The author should complete the required template sections by: filling in the "Fixes/Implements #" field with the relevant issue number (if applicable), replacing the placeholder "Summary Goes here." with an actual summary, marking the appropriate checkbox for "Type of change" (noting that "Enhancement" is mentioned separately in the PR), and marking the "How Has This Been Tested?" checkbox to indicate whether testing has been performed locally or requires production validation. The existing custom description sections can remain as supporting detail.
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "feat: add build_table_metrics_query method and improve fetch_sample_value" accurately reflects the main changes in the changeset. It clearly identifies the two primary additions: a new method for computing table metrics in MSSQL and improvements to the fetch_sample_values_from_database method. While "fetch_sample_value" appears truncated compared to the actual method name, the intent is sufficiently clear and the title effectively communicates the primary changes. The title is concise, uses proper semantic versioning format, and is specific enough for teammates reviewing history to understand the scope of work.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link

qodo-code-review bot commented Oct 21, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Move the normalization logic to a shared utility

The _normalize_metrics function, which converts database values for JSON
serialization, is currently defined within the MssqlDataSource class. It should
be relocated to a shared module or the SQLDataSource base class to prevent code
duplication and allow other database integrations to reuse it.

Examples:

dcs_core/integrations/databases/mssql.py [757-773]
        def _normalize_metrics(value):
            """Safely normalize DB metric values for JSON serialization."""
            if value is None:
                return None
            if isinstance(value, Decimal):
                return float(value)
            if isinstance(value, (int, float, bool)):
                return value
            if isinstance(value, (datetime.datetime, datetime.date)):
                return value.isoformat()

 ... (clipped 7 lines)

Solution Walkthrough:

Before:

class MssqlDataSource(SQLDataSource):
    def build_table_metrics_query(...):
        # ...
        result_row = cursor.fetchone()
        row = dict(zip(columns, result_row))

        def _normalize_metrics(value):
            """Safely normalize DB metric values for JSON serialization."""
            if isinstance(value, Decimal):
                return float(value)
            # ... other type conversions
            return str(value)

        for key, value in row.items():
            # ...
            col_metrics[metric_name] = _normalize_metrics(value)
        # ...

After:

class SQLDataSource:
    def _normalize_metrics(self, value):
        """Safely normalize DB metric values for JSON serialization."""
        if isinstance(value, Decimal):
            return float(value)
        # ... other type conversions
        return str(value)

class MssqlDataSource(SQLDataSource):
    def build_table_metrics_query(...):
        # ...
        result_row = cursor.fetchone()
        row = dict(zip(columns, result_row))

        for key, value in row.items():
            # ...
            col_metrics[metric_name] = self._normalize_metrics(value)
        # ...
Suggestion importance[1-10]: 7

__

Why: This is a strong architectural suggestion that correctly identifies the generic nature of the _normalize_metrics function and proposes moving it to a shared location to improve reusability and maintainability.

Medium
General
Refactor nested function into static method

Move the nested function _normalize_metrics out of build_table_metrics_query and
define it as a static method on the MssqlDataSource class to improve code
structure and reusability.

dcs_core/integrations/databases/mssql.py [757-773]

+@staticmethod
 def _normalize_metrics(value):
     """Safely normalize DB metric values for JSON serialization."""
     if value is None:
         return None
     if isinstance(value, Decimal):
         return float(value)
     if isinstance(value, (int, float, bool)):
         return value
-    if isinstance(value, (datetime.datetime, datetime.date)):
+    if isinstance(value, (datetime, datetime.date)):
         return value.isoformat()
     if isinstance(value, UUID):
         return str(value)
     if isinstance(value, list):
-        return [_normalize_metrics(v) for v in value]
+        return [MssqlDataSource._normalize_metrics(v) for v in value]
     if isinstance(value, dict):
-        return {k: _normalize_metrics(v) for k, v in value.items()}
+        return {k: MssqlDataSource._normalize_metrics(v) for k, v in value.items()}
     return str(value)
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that the nested function _normalize_metrics can be refactored into a static method to improve code structure and reusability, which is a valid code quality improvement.

Low
  • Update

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
dcs_core/integrations/databases/postgres.py (2)

291-319: Update docstring to reflect the changed return signature.

The return type changed from List[Tuple] to Tuple[List[Tuple], List[str]], but the docstring still states :return: List of row tuples. This is a breaking API change for existing callers.

Apply this diff to fix the docstring:

     def fetch_sample_values_from_database(
         self,
         table_name: str,
         column_names: list[str],
         limit: int = 5,
-    ) -> List[Tuple]:
+    ) -> Tuple[List[Tuple], List[str]]:
         """
         Fetch sample rows for specific columns from the given table.
 
         :param table_name: The name of the table.
         :param column_names: List of column names to fetch.
         :param limit: Number of rows to fetch.
-        :return: List of row tuples.
+        :return: Tuple of (list of row tuples, list of column names).
         """

338-340: Fix incorrect duplicate count formula.

The current formula COUNT(*) - COUNT(DISTINCT column) incorrectly counts NULL values as duplicates because COUNT(*) includes NULLs while COUNT(DISTINCT column) excludes them.

For a column with 10 total rows, 2 NULLs, and 5 distinct non-null values:

  • Current formula: 10 - 5 = 5 (incorrect, counts NULLs)
  • Correct formula: 8 - 5 = 3 (actual duplicates)

Apply this diff:

             query_parts.append(
-                f'COUNT(*) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"'
+                f'COUNT({self.quote_column(name)}) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"'
             )
🧹 Nitpick comments (3)
dcs_core/integrations/databases/mssql.py (3)

693-786: Consider extracting shared logic to reduce duplication.

The build_table_metrics_query implementation is nearly identical to the PostgreSQL version (lines 321-418 in postgres.py), differing only in:

  • Quoting style ([column] vs "column")
  • SQL dialect details (LEN vs CHAR_LENGTH, CAST(...AS FLOAT) vs direct AVG)

Consider extracting the common structure (query building, normalization logic) into a base class method with dialect-specific overrides for SQL generation.


755-755: Add strict=True to zip() for safety.

Without strict=True, mismatched lengths between columns and result_row will silently truncate data, potentially masking bugs.

Apply this diff:

-        row = dict(zip(columns, result_row))
+        row = dict(zip(columns, result_row, strict=True))

788-818: Consider shared interface for fetch_sample_values_from_database.

This method duplicates the structure and logic from postgres.py (lines 291-319), differing only in execution mechanism (pyodbc cursor vs SQLAlchemy text()). Similar to build_table_metrics_query, extracting common validation and structure to a base class would improve maintainability.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fec3972 and e9b1aa3.

📒 Files selected for processing (2)
  • dcs_core/integrations/databases/mssql.py (2 hunks)
  • dcs_core/integrations/databases/postgres.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
dcs_core/integrations/databases/mssql.py (1)
dcs_core/integrations/databases/postgres.py (5)
  • build_table_metrics_query (321-418)
  • quote_column (69-75)
  • qualified_table_name (59-67)
  • _normalize_metrics (376-405)
  • fetch_sample_values_from_database (291-319)
dcs_core/integrations/databases/postgres.py (1)
dcs_core/core/datasource/sql_datasource.py (2)
  • quote_column (148-154)
  • fetchall (117-120)
🪛 Ruff (0.14.1)
dcs_core/integrations/databases/mssql.py

755-755: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


805-805: Avoid specifying long messages outside the exception class

(TRY003)


808-808: Possible SQL injection vector through string-based query construction

(S608)


811-811: Possible SQL injection vector through string-based query construction

(S608)

dcs_core/integrations/databases/postgres.py

311-311: Possible SQL injection vector through string-based query construction

(S608)


314-314: Possible SQL injection vector through string-based query construction

(S608)

🔇 Additional comments (2)
dcs_core/integrations/databases/postgres.py (1)

321-418: Well-structured metrics computation with comprehensive type handling.

The method provides a clean interface for computing per-column metrics with proper type normalization for JSON serialization. The _normalize_metrics helper correctly handles Decimal→float, datetime→ISO string, and UUID→string conversions.

dcs_core/integrations/databases/mssql.py (1)

788-818: Implementation correctly mirrors PostgreSQL version.

The method properly uses pyodbc cursor API and maintains consistent behavior with the PostgreSQL implementation. The qualified table name and column quoting provide adequate protection against SQL injection despite static analysis warnings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
dcs_core/integrations/databases/mssql.py (2)

689-692: Consider using timezone-aware datetime.

The datetime usage now correctly matches the module import. However, datetime.datetime.utcnow() is deprecated in Python 3.12+ in favor of datetime.datetime.now(datetime.timezone.utc).

Consider this update for future-proofing:

             if isinstance(updated_time, str):
                 updated_time = datetime.datetime.strptime(
                     updated_time, "%Y-%m-%d %H:%M:%S.%f"
                 )
-            return int((datetime.datetime.utcnow() - updated_time).total_seconds())
+            return int((datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None) - updated_time).total_seconds())

757-757: Add strict=True to zip() for safety.

Without strict=True, zip() will silently truncate if columns and result_row have mismatched lengths, potentially hiding data corruption.

-        row = dict(zip(columns, result_row))
+        row = dict(zip(columns, result_row, strict=True))

Note: strict parameter requires Python 3.10+. If supporting older versions, verify lengths manually.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9b1aa3 and 181c673.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/mssql.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/mssql.py (2)
dcs_core/integrations/databases/postgres.py (5)
  • build_table_metrics_query (321-418)
  • quote_column (69-75)
  • qualified_table_name (59-67)
  • _normalize_metrics (376-405)
  • fetch_sample_values_from_database (291-319)
dcs_core/core/datasource/sql_datasource.py (4)
  • quote_column (148-154)
  • qualified_table_name (130-138)
  • fetchone (122-125)
  • fetchall (117-120)
🪛 Ruff (0.14.1)
dcs_core/integrations/databases/mssql.py

757-757: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


807-807: Avoid specifying long messages outside the exception class

(TRY003)


810-810: Possible SQL injection vector through string-based query construction

(S608)


813-813: Possible SQL injection vector through string-based query construction

(S608)

🔇 Additional comments (5)
dcs_core/integrations/databases/mssql.py (5)

15-22: LGTM: Import fixes address previous issues.

The change to import datetime (line 15) correctly resolves the past review concern about datetime usage at lines 689-692 and 767. The additional imports (Decimal, UUID, text) are necessary for the new metric normalization and query execution functionality.


695-704: LGTM: Method signature and early validation.

The method signature matches the Postgres implementation and the early return for empty column_info prevents unnecessary query execution.


759-775: LGTM: Value normalization helper.

The _normalize_metrics helper correctly handles Decimal, datetime, UUID, and recursive structures for JSON serialization. The datetime module usage (line 767) is correct.


777-788: LGTM: Column-wise metric transformation.

The logic correctly groups metrics by column name prefix and constructs the expected output structure matching the Postgres implementation.


790-808: LGTM: Method signature and validation.

The method signature matches the Postgres implementation, returning both rows and column names. Input validation correctly prevents empty column lists.

Comment on lines +705 to +741
for col in column_info:
name = col["column_name"]
dtype = col["data_type"].lower()

quoted_name = self.quote_column(name)

query_parts.append(f"COUNT(DISTINCT {quoted_name}) AS [{name}_distinct]")
query_parts.append(
f"COUNT({quoted_name}) - COUNT(DISTINCT {quoted_name}) AS [{name}_duplicate]"
)
query_parts.append(
f"SUM(CASE WHEN {quoted_name} IS NULL THEN 1 ELSE 0 END) AS [{name}_is_null]"
)

if dtype in (
"int",
"integer",
"bigint",
"smallint",
"tinyint",
"decimal",
"numeric",
"float",
"real",
"money",
"smallmoney",
):
query_parts.append(f"MIN({quoted_name}) AS [{name}_min]")
query_parts.append(f"MAX({quoted_name}) AS [{name}_max]")
query_parts.append(
f"AVG(CAST({quoted_name} AS FLOAT)) AS [{name}_average]"
)

elif dtype in ("varchar", "nvarchar", "char", "nchar", "text", "ntext"):
query_parts.append(
f"MAX(LEN({quoted_name})) AS [{name}_max_character_length]"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify duplicate count logic: excludes NULLs.

Line 713 uses COUNT(quoted_name) - COUNT(DISTINCT quoted_name) which excludes NULL values from the duplicate count. The Postgres implementation (line 341 in relevant snippets) uses COUNT(*) - COUNT(DISTINCT quoted_name), which includes the row count regardless of NULLs.

This difference means MSSQL will report fewer duplicates when NULL values exist. Ensure this behavior is intentional.

Run this script to verify the duplicate count behavior difference:


🏁 Script executed:

#!/bin/bash
# Description: Check if other datasources use COUNT(*) or COUNT(column) for duplicate calculation

# Search for duplicate count patterns in other database integrations
rg -nP -A2 -B2 'duplicate.*COUNT' --type=py -g 'dcs_core/integrations/databases/*.py'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for duplicate count logic in database integration files
# Check both the MSSQL and Postgres implementations

echo "=== MSSQL Implementation (lines 705-741) ===" 
sed -n '705,741p' dcs_core/integrations/databases/mssql.py

echo -e "\n=== Postgres Implementation (checking around line 341) ==="
sed -n '330,350p' dcs_core/integrations/databases/postgres.py

echo -e "\n=== Search for COUNT patterns in all database files ==="
rg -n 'COUNT\(' dcs_core/integrations/databases/ --type=py -B1 -A1 | grep -i duplicate

Length of output: 2500



🏁 Script executed:

#!/bin/bash
# Find all database integration files and check their duplicate count implementations

echo "=== Database integration files available ==="
fd --type f -e py 'dcs_core/integrations/databases/' | sort

echo -e "\n=== Duplicate count implementations across all databases ==="
for file in dcs_core/integrations/databases/*.py; do
  if [ -f "$file" ]; then
    echo "File: $(basename $file)"
    grep -n "duplicate" "$file" | head -5
  fi
done

Length of output: 983


🏁 Script executed:

#!/bin/bash
# Find all database integration files and check their duplicate count implementations

echo "=== Database integration files available ==="
fd --type f -e py 'dcs_core/integrations/databases/*.py' 2>/dev/null | sort

echo -e "\n=== Duplicate count logic across all database integrations ==="
for file in dcs_core/integrations/databases/*.py; do
  if [ -f "$file" ]; then
    basename=$(basename "$file")
    echo "=== $basename ==="
    grep -n "duplicate" "$file" -i -A1 | grep -E "(COUNT|duplicate)" | head -3
  fi
done

Length of output: 579


Inconsistent duplicate count logic between MSSQL and Postgres: MSSQL excludes NULLs.

Line 713 in MSSQL uses COUNT({quoted_name}) - COUNT(DISTINCT {quoted_name}), which excludes NULL values from the duplicate count. Postgres (line 339) uses COUNT(*) - COUNT(DISTINCT {quoted_name}), which includes all rows regardless of NULLs.

When NULL values exist in a column, MSSQL will report fewer duplicates than Postgres for the same data. This inconsistency should either be aligned across both implementations or explicitly documented as intentional behavior.

🤖 Prompt for AI Agents
dcs_core/integrations/databases/mssql.py around lines 705 to 741: the
duplicate-count expression uses COUNT(column) - COUNT(DISTINCT column) which
excludes NULLs and therefore differs from Postgres' COUNT(*) - COUNT(DISTINCT
column) behavior; to fix, change the MSSQL expression to COUNT(*) -
COUNT(DISTINCT {quoted_name}) so NULLs are treated the same as Postgres (or, if
excluding NULLs is desired, make the Postgres implementation mirror this by
using COUNT({quoted_name}) instead), and update any relevant comments/tests to
reflect the chosen consistent behavior.

Comment on lines +743 to +744
if additional_queries:
query_parts.extend(additional_queries)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: SQL injection risk in additional_queries parameter.

The additional_queries list is directly extended into the query without validation or parameterization. This allows arbitrary SQL injection if the caller passes malicious queries.

Either:

  1. Document that additional_queries must be pre-validated/sanitized by the caller, or
  2. Add validation to ensure queries are safe, or
  3. Remove this parameter if not strictly necessary.

Example mitigation (documentation approach):

     def build_table_metrics_query(
         self,
         table_name: str,
         column_info: list[dict],
         additional_queries: Optional[List[str]] = None,
     ) -> list[dict]:
+        """
+        Build and execute a query to compute column-level metrics.
+        
+        :param table_name: Name of the table
+        :param column_info: List of column metadata dicts with 'column_name' and 'data_type'
+        :param additional_queries: Pre-validated SQL fragments to extend SELECT list.
+                                   WARNING: Must be sanitized by caller to prevent SQL injection.
+        :return: List of dicts with column_name and metrics
+        """
         query_parts = []

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In dcs_core/integrations/databases/mssql.py around lines 743-744, the code
extends query_parts with additional_queries causing an SQL injection risk; to
fix, either remove the additional_queries parameter or enforce/validate it:
update the function to reject raw SQL strings by default (raise an error if
additional_queries is provided), or require callers to supply only parameterized
clause templates plus a corresponding parameters list and merge them safely, or
implement strict validation (whitelist allowed statement types/patterns or a
regex that only allows safe clause fragments like "AND <col> = %s") and raise
ValueError on failure; also update the function docstring to document the new
contract (what is accepted and that callers must provide parameters, not raw
SQL).

Comment on lines +809 to +814
if len(column_names) == 1 and column_names[0] == "*":
query = f"SELECT TOP {limit} * FROM {qualified_table_name}"
else:
columns = ", ".join([self.quote_column(col) for col in column_names])
query = f"SELECT TOP {limit} {columns} FROM {qualified_table_name}"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate limit parameter to prevent SQL injection.

While limit is typed as int, it's directly interpolated into the SQL string (lines 810, 813). If called with a malicious value through dynamic typing or type confusion, this could allow SQL injection.

Add explicit validation:

         if not column_names:
             raise ValueError("At least one column name must be provided")
 
+        if not isinstance(limit, int) or limit < 1:
+            raise ValueError("Limit must be a positive integer")
+
         if len(column_names) == 1 and column_names[0] == "*":
             query = f"SELECT TOP {limit} * FROM {qualified_table_name}"
🧰 Tools
🪛 Ruff (0.14.1)

810-810: Possible SQL injection vector through string-based query construction

(S608)


813-813: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents
In dcs_core/integrations/databases/mssql.py around lines 809 to 814, the TOP
{limit} interpolation uses limit directly which can lead to SQL injection if a
non-integer or malicious value is passed; validate and sanitize limit before
formatting the query by coercing it to an integer inside a try/except (or
checking isinstance), enforcing a positive range (e.g. 1..MAX_LIMIT) and raising
a ValueError on invalid values, or clamping to allowed bounds; do this
validation once before the if/else block so the constructed query only ever uses
a safely validated integer.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
dcs_core/integrations/databases/mssql.py (3)

713-713: Duplicate count logic still inconsistent with Postgres.

This uses COUNT({quoted_name}) which excludes NULL values, while Postgres uses COUNT(*) (line 339 in postgres.py). This means MSSQL and Postgres will report different duplicate counts when NULL values are present in the column.

For consistency across database implementations, align the duplicate count calculation. Either:

  • Change MSSQL to use COUNT(*) - COUNT(DISTINCT {quoted_name}), or
  • Update Postgres to match MSSQL's behavior and document the decision.

743-744: SQL injection risk: additional_queries parameter is not validated.

The additional_queries list is directly extended into the query without any validation or sanitization. If a caller passes malicious SQL fragments, they will be executed directly.

Either:

  1. Remove the parameter if not strictly necessary, or
  2. Document that callers MUST pre-validate/sanitize these queries, or
  3. Implement validation to reject unsafe SQL patterns.

809-813: Validate limit parameter to prevent potential SQL injection.

While limit is typed as int, it's directly interpolated into the SQL string. In dynamic typing scenarios or if the type hint is bypassed, this could allow SQL injection.

Add explicit validation before constructing the query.

Apply this diff:

         if not column_names:
             raise ValueError("At least one column name must be provided")
 
+        if not isinstance(limit, int) or limit < 1:
+            raise ValueError("limit must be a positive integer")
+
         if len(column_names) == 1 and column_names[0] == "*":
             query = f"SELECT TOP {limit} * FROM {qualified_table_name}"
🧹 Nitpick comments (1)
dcs_core/integrations/databases/mssql.py (1)

757-757: Consider adding explicit strict= parameter to zip().

Python 3.10+ supports strict=True in zip() to ensure both iterables have the same length. This prevents silent bugs if columns and result_row have mismatched lengths.

Apply this diff:

-        row = dict(zip(columns, result_row))
+        row = dict(zip(columns, result_row, strict=True))

Note: Only apply this if the minimum Python version is 3.10+. If supporting Python 3.9 or earlier, keep the current code.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 181c673 and 8305368.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/mssql.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/mssql.py (2)
dcs_core/integrations/databases/postgres.py (5)
  • build_table_metrics_query (321-418)
  • quote_column (69-75)
  • qualified_table_name (59-67)
  • _normalize_metrics (376-405)
  • fetch_sample_values_from_database (291-319)
dcs_core/core/datasource/sql_datasource.py (3)
  • quote_column (148-154)
  • qualified_table_name (130-138)
  • fetchall (117-120)
🪛 Ruff (0.14.1)
dcs_core/integrations/databases/mssql.py

757-757: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


807-807: Avoid specifying long messages outside the exception class

(TRY003)


810-810: Possible SQL injection vector through string-based query construction

(S608)


813-813: Possible SQL injection vector through string-based query construction

(S608)

🔇 Additional comments (5)
dcs_core/integrations/databases/mssql.py (5)

15-18: LGTM: Import changes are correct and consistent.

The datetime module import and additional imports (Decimal, UUID, text) are properly used throughout the new methods. The datetime module import correctly supports the qualified usage (datetime.datetime, datetime.date) in lines 689-692 and 767.

Also applies to: 22-22


689-692: LGTM: datetime usage correctly updated.

The datetime module is now properly qualified as datetime.datetime for both strptime and utcnow() calls, consistent with the module-level import.


759-775: LGTM: _normalize_metrics helper is well-implemented.

The normalization helper properly handles Decimal, datetime/date, UUID, and nested structures for JSON serialization. The logic is defensive and handles all expected database types correctly.


815-821: LGTM: Cursor resource management is properly implemented.

The try-finally block ensures the cursor is always closed, even if an exception occurs during query execution or result fetching.


790-822: Well-structured method with proper wildcard and return type support.

The method correctly:

  • Validates input parameters
  • Handles wildcard column selection ("*")
  • Extracts column names from query metadata
  • Returns a tuple matching the Postgres implementation signature

This ensures consistent behavior across database implementations.

@AnshumanX304 AnshumanX304 merged commit 6a27739 into datachecks:main Oct 21, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants