Skip to content

Conversation

@AnshumanX304
Copy link
Contributor

@AnshumanX304 AnshumanX304 commented Oct 14, 2025

User description

…s retrieval

Fixes/Implements

Description

Summary Goes here.

Type of change

Delete irrelevant options.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Locally Tested
  • Needs Testing From Production

PR Type

Enhancement


Description

  • Add build_table_metrics_query method to compute column-level statistics

  • Support numeric metrics (min, max, average) and text metrics (max character length)

  • Include distinct count, duplicate count, and null count for all columns

  • Implement value normalization for JSON serialization (Decimal, datetime, UUID)


Diagram Walkthrough

flowchart LR
  A["Column Info Input"] --> B["Build Metrics Query"]
  B --> C["Execute SQL Query"]
  C --> D["Normalize Values"]
  D --> E["Return Column Metrics"]
Loading

File Walkthrough

Relevant files
Enhancement
postgres.py
Add column metrics query builder with type normalization 

dcs_core/integrations/databases/postgres.py

  • Add build_table_metrics_query method to generate and execute column
    statistics queries
  • Implement metrics calculation for distinct, duplicate, and null counts
    across all columns
  • Add type-specific metrics: min/max/average for numeric types, max
    character length for text types
  • Create _normalize_metrics helper function to convert database types
    (Decimal, datetime, UUID) to JSON-serializable formats
+88/-1   

Summary by CodeRabbit

  • New Features
    • Column-level profiling for PostgreSQL tables with per-column metrics.
    • Metrics: distinct count, duplicate count, null count; min/max/average for numeric columns; max character length for text columns.
    • Option to include additional custom metric queries.
    • Results returned as per-column structures with JSON-friendly normalization for various value types.

@coderabbitai
Copy link

coderabbitai bot commented Oct 14, 2025

Walkthrough

Adds a new method to PostgresDataSource that builds and executes a single aggregated SQL query to compute per-column metrics for a table, then normalizes the single-row result into JSON-serializable Python values and returns a list of per-column dictionaries. (50 words)

Changes

Cohort / File(s) Summary of Changes
Postgres column metrics query
dcs_core/integrations/databases/postgres.py
Added build_table_metrics_query(self, table_name: str, column_info: list[dict], additional_queries: Optional[list[str]] = None) -> list[dict]. Constructs per-column SQL fragments (distinct, duplicate, is_null; numeric: min/max/avg; text: max_character_length), optionally appends extra queries, executes a single aggregated SELECT, maps the single-row result into per-column dicts, and normalizes Decimal, datetime/date, UUID, nested lists/dicts, and None into JSON-serializable Python types. Added imports for datetime, Decimal, and UUID. Handles empty column_info by returning [].

Sequence Diagram(s)

sequenceDiagram
    actor Caller
    participant DS as PostgresDataSource
    participant PG as PostgreSQL

    Caller->>DS: build_table_metrics_query(table_name, column_info, additional_queries?)
    DS->>DS: Assemble per-column SQL fragments\n(distinct, duplicate, is_null,\n+ type-specific expressions)
    DS->>PG: Execute single aggregated SELECT
    PG-->>DS: Single-row aggregated result
    DS->>DS: Split row into per-column dicts\nNormalize Decimal/date/datetime/UUID/nested
    DS-->>Caller: return List[dict] of column metrics

    rect rgb(235,245,255)
    note right of DS: New: dynamic SQL assembly\nand JSON-safe normalization
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • rishav2803

Poem

I hop through rows where metrics bloom,
Counting nulls and lengths in every room.
Distinct and duplicate, I softly chase,
Normalize Decimals to a JSON-friendly face.
A rabbit cheers for tidy queries! 🐇📊

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description retains the template structure but leaves key sections as placeholders and unchecked checklist items, omitting the issue reference, a meaningful summary, selected change type, and testing details. Please fill in the “Fixes/Implements” section with the relevant issue number, replace the placeholder summary with an actual description of the changes, select the appropriate type of change and testing checkboxes, and remove any unused template lines.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly identifies the addition of the build_table_metrics_query method and concisely describes its purpose in enhancing column metrics without extraneous detail or ambiguity.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link

qodo-code-review bot commented Oct 14, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
SQL injection

Description: Raw SQL query is built using unescaped table and column values from column_info and
table_name, enabling SQL injection if these inputs are not strictly validated or quoted.
postgres.py [351-354]

Referred Code
query = f"SELECT\n    {',\n    '.join(query_parts)}\nFROM {table_name};"

result = self.connection.execute(text(query))
row = dict(list(result)[0]._mapping)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

qodo-code-review bot commented Oct 14, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Re-architect metric generation for scalability

The current method of building a single, large SQL query for all column metrics
does not scale well. It should be refactored to use more targeted queries, such
as one per column or per metric type, to avoid database performance issues.

Examples:

dcs_core/integrations/databases/postgres.py [315-351]
    def build_table_metrics_query(
        self, table_name: str, column_info: list[dict]
    ) -> list[dict]:
        query_parts = []

        for col in column_info:
            name = col["column_name"]
            dtype = col["data_type"].lower()

            query_parts.append(f'COUNT(DISTINCT "{name}") AS "{name}_distinct"')

 ... (clipped 27 lines)

Solution Walkthrough:

Before:

def build_table_metrics_query(table_name, column_info):
    query_parts = []
    for col in column_info:
        name = col["column_name"]
        # Add multiple metrics for each column to a single list
        query_parts.append(f'COUNT(DISTINCT "{name}") AS "{name}_distinct"')
        query_parts.append(f'MIN("{name}") AS "{name}_min"')
        # ... and so on for all metrics
    
    # Combine all parts into one very large query
    query = f"SELECT {',\n    '.join(query_parts)} FROM {table_name};"
    
    # Execute the single, complex query
    result = self.connection.execute(text(query))
    row = result.first()
    # ... process the single wide row

After:

def build_table_metrics_query(table_name, column_info):
    all_column_metrics = []
    for col in column_info:
        name = col["column_name"]
        # Build a targeted query for each column
        metric_parts = []
        metric_parts.append(f'COUNT(DISTINCT "{name}") AS "distinct"')
        metric_parts.append(f'MIN("{name}") AS "min"')
        # ... and so on for all applicable metrics
        
        query = f"SELECT {', '.join(metric_parts)} FROM {table_name};"
        
        # Execute a simpler query per column
        result = self.connection.execute(text(query))
        metrics = result.first()._mapping
        all_column_metrics.append({"column_name": name, "metrics": metrics})
    
    return all_column_metrics
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies a critical architectural flaw where a single, wide SQL query will not scale for tables with many columns, leading to significant performance degradation or query failures.

High
Security
Prevent SQL injection using helpers
Suggestion Impact:The commit updated the query construction to use a qualified table name variable set via self.qualified_table_name(table_name) in the FROM clause, addressing the suggested SQL injection concern.

code diff:

+        qualified_table = self.qualified_table_name(table_name)
+        query = f'SELECT\n    {",\n    ".join(query_parts)}\nFROM {qualified_table};'
 

Prevent a potential SQL injection vulnerability by using the
self.qualified_table_name() method to properly quote and sanitize the table_name
in the SQL query string.

dcs_core/integrations/databases/postgres.py [351]

-query = f"SELECT\n    {',\n    '.join(query_parts)}\nFROM {table_name};"
+query = f"SELECT\n    {',\n    '.join(query_parts)}\nFROM {self.qualified_table_name(table_name)};"

[Suggestion processed]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a potential SQL injection vulnerability and proposes using an existing helper method (qualified_table_name) to sanitize the table name, which is a critical security fix.

High
Possible issue
Raise error for unhandled data types
Suggestion Impact:The commit updated the datetime isinstance check to use datetime.datetime instead of datetime, aligning with part of the suggestion. However, it did not implement raising TypeError for unhandled types (the diff snippet ends before showing the fallback change). Thus, the impact is partial.

code diff:

             if isinstance(value, (int, float, bool)):
                 return value
 
-            if isinstance(value, (datetime, datetime.date)):
+            if isinstance(value, (datetime.datetime, datetime.date)):
                 return value.isoformat()
 

In the _normalize_metrics function, replace the final return str(value) with
raise TypeError to prevent silently converting unhandled data types to strings
and instead fail explicitly.

dcs_core/integrations/databases/postgres.py [374-385]

-if isinstance(value, (datetime, datetime.date)):
+if isinstance(value, (datetime.datetime, datetime.date)):
     return value.isoformat()
 
 if isinstance(value, UUID):
     return str(value)
 
 if isinstance(value, list):
     return [_normalize_metrics(v) for v in value]
 if isinstance(value, dict):
     return {k: _normalize_metrics(v) for k, v in value.items()}
 
-return str(value)
+# Fallback for any other types that are not explicitly handled.
+# It's safer to raise an error than to silently convert to string.
+raise TypeError(f"Unserializable type: {type(value)}")

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that silently converting unhandled types to strings can hide bugs and proposes a fail-fast approach by raising a TypeError, which improves the code's robustness.

Medium
General
Improve performance by fetching single result

Replace list(result)[0] with result.first() to more efficiently fetch the single
result row from the aggregation query.

dcs_core/integrations/databases/postgres.py [353-354]

 result = self.connection.execute(text(query))
-row = dict(list(result)[0]._mapping)
+row = dict(result.first()._mapping)
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly points out that using result.first() is more idiomatic and efficient than list(result)[0] for fetching a single row, improving code quality and adherence to best practices.

Low
  • Update

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (5)
dcs_core/integrations/databases/postgres.py (5)

353-355: Avoid materializing result; fetch one mapping row.

Prefer result.mappings().one() (or .first() if empty possible). Also aligns with Ruff hint.

-        result = self.connection.execute(text(query))
-        row = dict(list(result)[0]._mapping)
+        result = self.connection.execute(text(query))
+        row = result.mappings().one()

Based on static analysis hints


332-351: Postgres type checks are incomplete/misaligned.

  • Numeric: include real, double precision. int, float, double aren’t information_schema.data_type values.
  • Text: include character, drop string.
-            if dtype in (
-                "int",
-                "integer",
-                "bigint",
-                "smallint",
-                "decimal",
-                "numeric",
-                "float",
-                "double",
-            ):
+            if dtype in (
+                "integer",
+                "bigint",
+                "smallint",
+                "decimal",
+                "numeric",
+                "real",
+                "double precision",
+            ):
                 query_parts.append(f'MIN("{name}") AS "{name}_min"')
                 query_parts.append(f'MAX("{name}") AS "{name}_max"')
                 query_parts.append(f'AVG("{name}") AS "{name}_average"')

-            elif dtype in ("varchar", "text", "char", "string", "character varying"):
+            elif dtype in ("text", "character", "character varying", "varchar", "char"):
                 query_parts.append(
                     f'MAX(CHAR_LENGTH("{name}")) AS "{name}_max_character_length"'
                 )

320-351: Use quote_column helper consistently.

You’re manually quoting identifiers; prefer self.quote_column(name) to keep quoting logic centralized.

Example:

-            query_parts.append(f'COUNT(DISTINCT "{name}") AS "{name}_distinct"')
+            qcol = self.quote_column(name)
+            query_parts.append(f'COUNT(DISTINCT {qcol}) AS "{name}_distinct"')

Apply similarly for all occurrences.


315-318: Method name vs behavior mismatch.

It both builds and executes the query. Consider renaming to fetch_table_metrics or update docstring to reflect execution.


315-351: Operational note: heavy aggregates on wide/large tables.

Multiple COUNT(DISTINCT ...) + MIN/MAX/AVG across many columns can be expensive. Consider:

  • Optional sampling ratio (e.g., TABLESAMPLE) for exploratory metrics.
  • Per-column feature flag to limit aggregates.
  • Running in off-peak or with statement timeout.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f70e2db and e0ed19b.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/postgres.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.0)
dcs_core/integrations/databases/postgres.py

354-354: Prefer next(iter(result)) over single element slice

Replace with next(iter(result))

(RUF015)

Comment on lines 374 to 378
if isinstance(value, (datetime, datetime.date)):
return value.isoformat()

if isinstance(value, UUID):
return str(value)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix isinstance checks for datetime and UUID (runtime error + wrong type).

  • datetime in isinstance(..., (datetime, datetime.date)) is a module, causing TypeError.
  • UUID from SQLAlchemy isn’t the runtime type returned by drivers; use uuid.UUID.

Apply:

-            if isinstance(value, (datetime, datetime.date)):
+            if isinstance(value, (datetime.datetime, datetime.date)):
                 return value.isoformat()

-            if isinstance(value, UUID):
+            if isinstance(value, PyUUID):
                 return str(value)

And adjust imports per separate comment.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In dcs_core/integrations/databases/postgres.py around lines 374 to 378, the
isinstance checks are wrong: `datetime` is a module (causing TypeError) and
`UUID` is the SQLAlchemy type (not the runtime uuid type). Replace the tuple
check with (datetime.datetime, datetime.date) and use uuid.UUID for the UUID
check; update imports to ensure datetime (module) and uuid (module) are
available or import datetime.datetime and uuid.UUID as needed per the other
import comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
dcs_core/integrations/databases/postgres.py (4)

19-19: Fix the UUID import: use stdlib uuid.UUID, not SQLAlchemy's UUID type.

This is the same critical issue flagged in a previous review. The SQLAlchemy UUID is a type definition for columns, not the runtime value class returned by database drivers. Runtime UUID values from psycopg are uuid.UUID from the standard library.

Apply this diff:

-from sqlalchemy import UUID, create_engine, text
+from sqlalchemy import create_engine, text
+from uuid import UUID as PyUUID

Then update line 382 to use PyUUID instead of UUID.


318-321: Guard against empty column_info to prevent invalid SQL generation.

This critical issue was flagged in a previous review and remains unresolved. If column_info is empty, the generated SQL becomes SELECT FROM ..., which is syntactically invalid and will cause a runtime error.

Apply this diff:

     ) -> list[dict]:
         query_parts = []
+        if not column_info:
+            return []
 
         for col in column_info:

325-327: Fix duplicate count formula: NULLs should not be counted as duplicates.

This critical issue was flagged in a previous review and remains unresolved. The formula COUNT(*) - COUNT(DISTINCT "{name}") incorrectly treats NULLs as duplicates because COUNT(*) includes NULL rows while COUNT(DISTINCT "{name}") excludes them.

The correct formula excludes NULLs from both counts:

Apply this diff:

-            query_parts.append(
-                f'COUNT(*) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"'
-            )
+            query_parts.append(
+                f'COUNT({self.quote_column(name)}) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"'
+            )

379-383: Fix isinstance checks: datetime is a module, and UUID must be from stdlib.

This critical issue was flagged in a previous review and remains unresolved. Two problems:

  1. Line 379: datetime is the module, not the class. Using it in isinstance will raise TypeError: isinstance() arg 2 must be a type or tuple of types.
  2. Line 382: UUID from SQLAlchemy is a column type class, not the runtime value type. Database drivers return uuid.UUID from the standard library.

Apply this diff (after fixing the import per the first comment):

-            if isinstance(value, (datetime, datetime.date)):
+            if isinstance(value, (datetime.datetime, datetime.date)):
                 return value.isoformat()
 
-            if isinstance(value, UUID):
+            if isinstance(value, PyUUID):
                 return str(value)
🧹 Nitpick comments (1)
dcs_core/integrations/databases/postgres.py (1)

359-359: Prefer next(iter(result)) for efficiency.

Creating a full list from the result just to access the first element is inefficient. Use next(iter(result)) instead.

Apply this diff:

         result = self.connection.execute(text(query))
-        row = dict(list(result)[0]._mapping)
+        row = dict(next(iter(result))._mapping)

Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0ed19b and 989ee55.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/postgres.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (5)
dcs_core/integrations/databases/mysql.py (2)
  • quote_column (86-92)
  • qualified_table_name (76-84)
dcs_core/integrations/databases/oracle.py (2)
  • quote_column (80-86)
  • qualified_table_name (70-78)
dcs_core/integrations/databases/mssql.py (2)
  • quote_column (120-126)
  • qualified_table_name (110-118)
dcs_core/integrations/databases/sybase.py (2)
  • quote_column (210-216)
  • qualified_table_name (200-208)
dcs_core/core/datasource/sql_datasource.py (2)
  • quote_column (148-154)
  • qualified_table_name (130-138)
🪛 GitHub Actions: CI
dcs_core/integrations/databases/postgres.py

[error] 1-1: Trailing whitespace detected and removed by pre-commit hook: fix in file.


[error] 356-356: Black formatting failed: cannot parse source due to f-string backslash issue. Possible Python version compatibility problem.

🪛 Ruff (0.14.0)
dcs_core/integrations/databases/postgres.py

359-359: Prefer next(iter(result)) over single element slice

Replace with next(iter(result))

(RUF015)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
dcs_core/integrations/databases/postgres.py (3)

316-322: Guard against empty column_info to prevent invalid SQL.

If column_info is empty, the generated SQL becomes SELECT FROM ..., which is invalid and will cause a runtime error.

Apply this diff to add an early guard:

     ) -> list[dict]:
         query_parts = []
+        if not column_info:
+            return []

         for col in column_info:

328-336: Duplicate count formula treats NULLs as duplicates.

Line 332 uses COUNT(*) - COUNT(DISTINCT ...) which incorrectly treats NULL values as duplicates. The correct formula excludes NULLs.

Apply this diff to fix the duplicate count calculation:

             query_parts.append(
-                f'COUNT(*) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"'
+                f'COUNT({self.quote_column(name)}) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"'
             )

387-388: Fix isinstance check: datetime is a module, not a type.

Line 387 uses isinstance(value, (datetime, datetime.date)) but datetime is imported as a module (line 15), not a class. This will raise TypeError: isinstance() arg 2 must be a type or tuple of types at runtime.

Apply this diff:

-            if isinstance(value, (datetime, datetime.date)):
+            if isinstance(value, (datetime.datetime, datetime.date)):
                 return value.isoformat()
🧹 Nitpick comments (2)
dcs_core/integrations/databases/postgres.py (2)

359-361: Rename loop variable and document SQL injection risk.

The loop variable queries (plural) iterates over individual query strings, which is confusing. Additionally, additional_queries strings are concatenated directly without validation, creating a potential SQL injection risk if input is not sanitized by the caller.

Apply this diff to improve clarity:

         if additional_queries:
-            for queries in additional_queries:
-                query_parts.append(queries)
+            for query in additional_queries:
+                query_parts.append(query)

Consider adding a docstring warning that additional_queries must be pre-validated to prevent SQL injection.


363-367: Use next(iter(...)) instead of list slicing for single-row result.

Line 367 materializes the entire result set into a list before taking the first element, which is inefficient. Since the aggregate query returns exactly one row, use next(iter(result)) instead.

As per static analysis hints.

Apply this diff:

         result = self.connection.execute(text(query))
-        row = dict(list(result)[0]._mapping)
+        row = dict(next(iter(result))._mapping)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 989ee55 and ff9a6d1.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/postgres.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (5)
dcs_core/integrations/databases/mysql.py (2)
  • quote_column (86-92)
  • qualified_table_name (76-84)
dcs_core/integrations/databases/oracle.py (2)
  • quote_column (80-86)
  • qualified_table_name (70-78)
dcs_core/integrations/databases/mssql.py (2)
  • quote_column (120-126)
  • qualified_table_name (110-118)
dcs_core/integrations/databases/sybase.py (2)
  • quote_column (210-216)
  • qualified_table_name (200-208)
dcs_core/core/datasource/sql_datasource.py (2)
  • quote_column (148-154)
  • qualified_table_name (130-138)
🪛 Ruff (0.14.0)
dcs_core/integrations/databases/postgres.py

367-367: Prefer next(iter(result)) over single element slice

Replace with next(iter(result))

(RUF015)

🔇 Additional comments (5)
dcs_core/integrations/databases/postgres.py (5)

15-18: LGTM!

The imports are correctly added to support the new functionality.


338-352: LGTM!

The numeric type detection and metric calculations (min, max, average) are correctly implemented.


354-357: LGTM!

The text type detection and max character length calculation are correctly implemented.


369-398: Well-structured normalization logic with good defensive programming.

The _normalize_metrics helper correctly handles various data types and includes proper None checks, recursive handling of nested structures, and safe fallback to string conversion. The only issue is the datetime isinstance check on line 387 (flagged separately).


400-411: LGTM!

The result transformation logic correctly maps the flat row dictionary to a structured per-column format, applying normalization to each metric value.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
dcs_core/integrations/databases/postgres.py (1)

333-335: Duplicate count is wrong when NULLs exist.

COUNT(*) - COUNT(DISTINCT ...) treats every NULL as a duplicate. Use COUNT(column) so NULLs are excluded.

Apply:

-            query_parts.append(
-                f'COUNT(*) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"'
-            )
+            query_parts.append(
+                f'COUNT({self.quote_column(name)}) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"'
+            )
🧹 Nitpick comments (1)
dcs_core/integrations/databases/postgres.py (1)

368-369: Fetch the single row without materializing the whole result.

list(result)[0] pulls the entire cursor into memory. Use result.fetchone() (or result.mappings().first()) to grab the only row more efficiently.

Consider:

-        result = self.connection.execute(text(query))
-        row = dict(list(result)[0]._mapping)
+        result = self.connection.execute(text(query))
+        fetched = result.fetchone()
+        row = dict(fetched._mapping) if fetched else {}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff9a6d1 and f6c0119.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/postgres.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (2)
dcs_core/integrations/databases/mysql.py (2)
  • quote_column (86-92)
  • qualified_table_name (76-84)
dcs_core/core/datasource/sql_datasource.py (2)
  • quote_column (148-154)
  • qualified_table_name (130-138)
🪛 Ruff (0.14.0)
dcs_core/integrations/databases/postgres.py

369-369: Prefer next(iter(result)) over single element slice

Replace with next(iter(result))

(RUF015)

@AnshumanX304 AnshumanX304 merged commit 851fa77 into datachecks:main Oct 14, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants