Skip to content

Conversation

@Kanai2003
Copy link
Contributor

@Kanai2003 Kanai2003 commented Nov 28, 2025

…bution graph for low distinct counts

Fixes/Implements

Description

Summary Goes here.

Type of change

Delete irrelevant options.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Locally Tested
  • Needs Testing From Production

Summary by CodeRabbit

  • Bug Fixes

    • More accurate distinct-value counts for JSON/JSONB columns, improving metric correctness.
  • New Features

    • Column distribution graphs for low-cardinality columns to show value-frequency patterns.
    • Distribution generation is non-blocking — failures are logged but don’t stop other metrics.
    • Metrics output now separates general metrics from distribution data for clearer results.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 28, 2025

Walkthrough

Adds JSON/JSONB-aware DISTINCT handling and per-column distribution_graph generation in table metrics: casts JSON/JSONB to text for distinct/grouping, computes COUNT(DISTINCT ...) using the cast expression, runs small-cardinality distributions, and returns metrics split into general and distribution sections.

Changes

Cohort / File(s) Summary
PostgreSQL metrics enhancement
dcs_core/integrations/databases/postgres.py
Updated build_table_metrics_query: use a per-column distinct_expr (casts json/jsonb to ::text) for COUNT(DISTINCT ...) and duplicate calculations; add distribution query for columns with distinct count < 20 (produces distribution_graph as list of {col_val, count}), log failures and continue; post-process metrics into general_data and distribution_data. Public API unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus areas:
    • SQL expression construction and proper quoting/casting for distinct_expr and group_expr
    • Correctness of distribution query, grouping, ordering, and normalization into distribution_graph
    • Performance and edge cases for running additional distribution queries per column
    • Error handling where failures are logged to stdout and processing continues
    • Verification of final metrics shape (nulls, missing distribution entries)

Possibly related PRs

Suggested labels

Review effort 3/5

Suggested reviewers

  • rishav2803
  • Ryuk-me

Poem

🐰 I hop through rows where nested values hide,
I cast JSON to text and count with pride,
For tiny groups I map each little chart,
Metrics sorted neat — a rabbit's counting art,
Crunching numbers, snacks, and joy inside.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is incomplete and contains placeholder text. The 'Fixes/Implements' section lacks an issue number, the 'Description' section contains only placeholder text 'Summary Goes here.', and all type of change and testing checkboxes are unchecked without clarification. Complete the description by providing a clear summary, linking the issue number in 'Fixes/Implements #', selecting the appropriate type of change, and confirming testing status.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: adding distribution graph functionality for columns with distinct counts below 20.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch distribution-data

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2f9cf0 and d8c7045.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/postgres.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py

442-443: Possible SQL injection vector through string-based query construction

(S608)


461-461: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)
dcs_core/integrations/databases/postgres.py (1)

334-343: JSON/JSONB-aware distinct and duplicate metrics look correct.

Using distinct_expr with ::text for JSON/JSONB columns and reusing that in both COUNT(DISTINCT …) and the duplicate metric keeps DISTINCT/duplicate semantics aligned with the distribution logic below. No issues from a correctness perspective here.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Kanai2003 Kanai2003 changed the title feat: Add distribution graph below 20 distinct count feat: Add distribution graph below 20 distinct count [#DCS-1244] Nov 28, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dcs_core/integrations/databases/postgres.py (1)

380-380: Fix the f-string formatting error causing pipeline failure.

The f-string expression contains a backslash inside {",\n ".join(...)}, which is not allowed and is causing Black to fail.

Apply this diff to fix the formatting error:

-        query = f'SELECT\n    {",\n    ".join(query_parts)}\nFROM {qualified_table};'
+        separator = ",\n    "
+        query = f'SELECT\n    {separator.join(query_parts)}\nFROM {qualified_table};'
🧹 Nitpick comments (1)
dcs_core/integrations/databases/postgres.py (1)

433-433: Add a default value to the next() call for safety.

If the column is not found in column_info, this will raise StopIteration. While unlikely in practice, adding a default value would make the code more defensive.

-            dtype = next(c["data_type"].lower() for c in column_info if c["column_name"] == col_name)
+            dtype = next((c["data_type"].lower() for c in column_info if c["column_name"] == col_name), None)
+            if dtype is None:
+                continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f26f7b and d859779.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/postgres.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (5)
dcs_core/integrations/databases/oracle.py (1)
  • quote_column (83-89)
dcs_core/integrations/databases/mysql.py (1)
  • quote_column (87-93)
dcs_core/integrations/databases/mssql.py (3)
  • quote_column (123-129)
  • fetchall (107-108)
  • _normalize_metrics (759-775)
dcs_core/integrations/databases/sybase.py (2)
  • quote_column (210-216)
  • fetchall (194-195)
dcs_core/core/datasource/sql_datasource.py (2)
  • quote_column (151-157)
  • fetchall (120-123)
🪛 GitHub Actions: CI
dcs_core/integrations/databases/postgres.py

[error] 380-380: Black formatting failed on this file (pre-commit hook): cannot format dcs_core/integrations/databases/postgres.py: cannot use --safe with this file; failed to parse source file AST: f-string expression part cannot include a backslash (, line 380). This may be caused by running Black with an older Python version.

🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py

445-446: Possible SQL injection vector through string-based query construction

(S608)


463-464: try-except-pass detected, consider logging the exception

(S110)


463-463: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)
dcs_core/integrations/databases/postgres.py (2)

334-342: JSON/JSONB handling looks correct.

Casting JSON/JSONB columns to text for DISTINCT operations is the correct approach in PostgreSQL, as JSON values cannot be directly compared without conversion.


343-349: Metric calculations correctly use the JSON-aware distinct expression.

The updated queries properly apply the distinct_expr for both distinct counts and duplicate calculations.

Comment on lines 449 to 464
try:
dist_result = self.connection.execute(text(dist_query)).fetchall()

# ---- UPDATED FORMAT ----
distribution = []
for r in dist_result:
val = _normalize_metrics(r[0])
distribution.append( {
"col_val": val,
"count": r[1],
})

metrics["distribution_graph"] = distribution

except Exception:
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add logging or use a more specific exception handler.

The silent exception handling swallows all errors, making it difficult to debug issues with distribution graph generation. While it's reasonable to not fail the entire metrics collection if one distribution fails, consider at least logging the error.

Apply this diff to add basic error logging:

+                import logging
+                logger = logging.getLogger(__name__)
+
                 try:
                     dist_result = self.connection.execute(text(dist_query)).fetchall()
 
                     # ---- UPDATED FORMAT ----
                     distribution = []
                     for r in dist_result:
                         val = _normalize_metrics(r[0])
                         distribution.append( {
                             "col_val": val,
                             "count": r[1],
                         })
 
                     metrics["distribution_graph"] = distribution
 
-                except Exception:
-                    pass
+                except Exception as e:
+                    logger.debug(f"Failed to generate distribution graph for column {col_name}: {e}")

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.6)

463-464: try-except-pass detected, consider logging the exception

(S110)


463-463: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In dcs_core/integrations/databases/postgres.py around lines 449 to 464, the
except block silently swallows all exceptions when building the
distribution_graph; change it to catch Exception as e and log the error instead
of passing (e.g., use self.logger.exception(...) if a logger exists on the class
or the module-level logging.exception(...)) so the error and stack trace are
recorded while still allowing metrics collection to continue.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
dcs_core/integrations/databases/postgres.py (1)

467-470: Replace print() with proper logging.

Using print() for error output is not suitable for production. This concern was raised in a previous review.

Apply this diff to use logging:

+import logging
+
+logger = logging.getLogger(__name__)
+
                 except Exception as e:
-                    print(
-                        f"Failed to generate distribution graph for column {col_name}: {e}"
-                    )
+                    logger.warning(
+                        "Failed to generate distribution graph for column %s: %s",
+                        col_name,
+                        e,
+                    )
🧹 Nitpick comments (2)
dcs_core/integrations/databases/postgres.py (2)

431-435: Consider adding a default value to next() for defensive coding.

If col_name is somehow not found in column_info, this will raise StopIteration. While unlikely since both come from the same source, a default would add robustness.

             dtype = next(
-                c["data_type"].lower()
-                for c in column_info
-                if c["column_name"] == col_name
+                (c["data_type"].lower()
+                 for c in column_info
+                 if c["column_name"] == col_name),
+                None
             )
+            if dtype is None:
+                continue

437-449: Performance consideration: Multiple queries for distribution graphs.

For tables with many low-cardinality columns, this executes a separate GROUP BY query per column. On large tables, this could add latency. Consider batching or documenting the trade-off.

If performance becomes an issue, columns could be batched using UNION ALL:

SELECT 'col_a' AS col, col_a::text AS val, COUNT(*) FROM t GROUP BY col_a
UNION ALL
SELECT 'col_b' AS col, col_b::text AS val, COUNT(*) FROM t GROUP BY col_b
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d859779 and a11a16f.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/postgres.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (5)
dcs_core/integrations/databases/mysql.py (1)
  • quote_column (87-93)
dcs_core/integrations/databases/oracle.py (1)
  • quote_column (83-89)
dcs_core/core/datasource/sql_datasource.py (2)
  • quote_column (151-157)
  • fetchall (120-123)
dcs_core/integrations/databases/mssql.py (2)
  • quote_column (123-129)
  • fetchall (107-108)
dcs_core/integrations/databases/sybase.py (2)
  • quote_column (210-216)
  • fetchall (194-195)
🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py

447-448: Possible SQL injection vector through string-based query construction

(S608)


467-467: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)
dcs_core/integrations/databases/postgres.py (2)

334-341: LGTM! Correct handling of JSON/JSONB for DISTINCT operations.

PostgreSQL doesn't support direct equality comparison for JSON/JSONB types, so casting to ::text is the appropriate solution for DISTINCT counting.


343-350: LGTM! Consistent use of distinct_expr for DISTINCT-based metrics.

The NULL check correctly continues to use the original quoted column since IS NULL works directly on JSON/JSONB types.

Comment on lines +446 to +449
dist_query = (
f"SELECT {group_expr}, COUNT(*) "
f"FROM {qualified_table} GROUP BY {group_expr} ORDER BY COUNT(*) DESC"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's locate and examine the postgres.py file around lines 446-449
cat -n dcs_core/integrations/databases/postgres.py | sed -n '430,470p'

Repository: datachecks/dcs-core

Length of output: 1865


🏁 Script executed:

#!/bin/bash
# Find the quote_column implementation in postgres.py
rg -n "def quote_column" dcs_core/integrations/databases/postgres.py -A 5

Repository: datachecks/dcs-core

Length of output: 260


🏁 Script executed:

#!/bin/bash
# Check for quote_column implementations across all datasources
rg -n "def quote_column" dcs_core/integrations/databases/ -A 3 --type py

Repository: datachecks/dcs-core

Length of output: 2268


🏁 Script executed:

#!/bin/bash
# Check for logging imports in postgres.py
rg -n "^import|^from" dcs_core/integrations/databases/postgres.py | head -20

Repository: datachecks/dcs-core

Length of output: 502


🏁 Script executed:

#!/bin/bash
# Find the function containing this code to understand column_info origin
rg -n "def.*distribution" dcs_core/integrations/databases/postgres.py -B 5 -A 40 | head -60

Repository: datachecks/dcs-core

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Get full quote_column implementation in postgres.py
sed -n '69,80p' dcs_core/integrations/databases/postgres.py

Repository: datachecks/dcs-core

Length of output: 393


🏁 Script executed:

#!/bin/bash
# Get quote_column implementations from other datasources to compare
echo "=== MySQL quote_column ===" && sed -n '87,95p' dcs_core/integrations/databases/mysql.py
echo ""
echo "=== Oracle quote_column ===" && sed -n '83,91p' dcs_core/integrations/databases/oracle.py

Repository: datachecks/dcs-core

Length of output: 605


🏁 Script executed:

#!/bin/bash
# Search for the function containing the distribution graph code
rg -n "def.*" dcs_core/integrations/databases/postgres.py | rg -B 5 -A 5 "430:"

Repository: datachecks/dcs-core

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Find what function contains line 430
awk '/^[[:space:]]*(def|class) / { func=$0; line=NR } NR==430 { print func " at line " line }' dcs_core/integrations/databases/postgres.py

Repository: datachecks/dcs-core

Length of output: 232


🏁 Script executed:

#!/bin/bash
# Better approach: find function by looking backwards from line 430
sed -n '1,430p' dcs_core/integrations/databases/postgres.py | tail -50 | grep -n "def "</method> to see function name

Repository: datachecks/dcs-core

Length of output: 116


🏁 Script executed:

#!/bin/bash
# Find the function containing line 430 by searching backwards
sed -n '400,430p' dcs_core/integrations/databases/postgres.py | grep -E "^\s*(def|class) " | tail -1

Repository: datachecks/dcs-core

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Alternative: look at more context before line 430
sed -n '350,435p' dcs_core/integrations/databases/postgres.py | head -100

Repository: datachecks/dcs-core

Length of output: 3081


🏁 Script executed:

#!/bin/bash
# Search for where column_info is used and its origin
rg -n "column_info" dcs_core/integrations/databases/postgres.py -B 3 -A 3 | head -40

Repository: datachecks/dcs-core

Length of output: 1403


🏁 Script executed:

#!/bin/bash
# Check if logging is used elsewhere in postgres.py for error handling
rg -n "logging\.|logger\." dcs_core/integrations/databases/postgres.py

Repository: datachecks/dcs-core

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Check for other exception handlers using print() in postgres.py
rg -n "except.*:|print\(" dcs_core/integrations/databases/postgres.py | grep -A 2 "except"

Repository: datachecks/dcs-core

Length of output: 280


🏁 Script executed:

#!/bin/bash
# Check how other datasources handle exceptions (logging vs print)
rg -n "except.*:|print\(|logging\." dcs_core/integrations/databases/mysql.py | head -20

Repository: datachecks/dcs-core

Length of output: 131


🏁 Script executed:

#!/bin/bash
# Check if logging is imported/used in the base SQLDataSource class
rg -n "import logging|logger" dcs_core/core/datasource/sql_datasource.py | head -10

Repository: datachecks/dcs-core

Length of output: 477


🏁 Script executed:

#!/bin/bash
# Check what the exception handlers in mysql.py do
sed -n '72,80p' dcs_core/integrations/databases/mysql.py
sed -n '423,430p' dcs_core/integrations/databases/mysql.py

Repository: datachecks/dcs-core

Length of output: 647


Use loguru logger instead of print() for exception handling; SQL injection risk is low but inherent to quote_column pattern.

The distribution graph query at lines 446-449 has two concerns:

  1. Exception handling (lines 467-470): Uses print() instead of the logging pattern established in the base SQLDataSource class, which imports and uses loguru.logger. Other datasources like mysql.py use logger.error() correctly. Replace with logger.error() for consistency.

  2. SQL injection via unescaped quotes: The quote_column() method returns f'"{column}"' without escaping embedded double-quotes. While col_name originates from database metadata (trusted source), PostgreSQL allows embedded quotes in identifier names. This is a pre-existing pattern across all datasources (mysql uses backticks, oracle uses double quotes—none escape). The risk is low given the trusted data source, but if hardening is desired, escape embedded quotes: f'"{column.replace(chr(34), chr(34)+chr(34))}"'.

🧰 Tools
🪛 Ruff (0.14.6)

447-448: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents
In dcs_core/integrations/databases/postgres.py around lines 446-470, replace the
use of print() in the exception handler with the module's loguru logger
(logger.error(...)) and include the exception info in the log call to match the
SQLDataSource logging pattern; additionally, harden quote_column to escape
embedded double-quotes by doubling them (e.g., replace " with "" before wrapping
in quotes) so identifiers containing quotes are safely quoted.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
dcs_core/integrations/databases/postgres.py (1)

467-470: Replace print() with logger for consistency.

The codebase uses loguru.logger (imported in the base SQLDataSource class) for error handling, but this exception handler uses print(). This inconsistency was flagged in a previous review but remains unresolved.

Apply this diff to use the logger:

+                from loguru import logger
+
                 except Exception as e:
-                    print(
-                        f"Failed to generate distribution graph for column {col_name}: {e}"
-                    )
+                    logger.error(
+                        f"Failed to generate distribution graph for column {col_name}: {e}"
+                    )

Alternatively, if a logger instance is already available in the class, use self.logger instead.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a11a16f and 7350c04.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/postgres.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (4)
dcs_core/integrations/databases/mysql.py (1)
  • quote_column (87-93)
dcs_core/core/datasource/sql_datasource.py (2)
  • quote_column (151-157)
  • fetchall (120-123)
dcs_core/integrations/databases/oracle.py (1)
  • quote_column (83-89)
dcs_core/integrations/databases/sybase.py (2)
  • quote_column (210-216)
  • fetchall (194-195)
🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py

447-448: Possible SQL injection vector through string-based query construction

(S608)


467-467: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (4)
dcs_core/integrations/databases/postgres.py (4)

334-341: LGTM! JSON/JSONB handling for DISTINCT is correct.

The cast to ::text for JSON/JSONB types is necessary since PostgreSQL cannot directly use DISTINCT on JSON columns. The implementation is clean and well-commented.


344-346: LGTM! Correct use of distinct_expr for DISTINCT operations.

The metrics correctly use distinct_expr which handles JSON/JSONB casting, while the NULL check at line 349 appropriately uses the original quoted column without casting.


437-437: Confirm the distinct count threshold is intentional.

The condition distinct_count < 20 generates distribution graphs for columns with 0-19 distinct values. Verify this threshold aligns with the intended behavior (the PR title says "below 20" which matches, but worth confirming 20 exactly should be excluded).

If the intention is to include 20, change to:

-            if isinstance(distinct_count, (int, float)) and distinct_count < 20:
+            if isinstance(distinct_count, (int, float)) and distinct_count <= 20:

472-483: I'll help you verify the metrics structure change and its impact on consumers. Let me start by exploring the codebase to understand the scope of this change.
<function_calls>

#!/bin/bash

Find the method definition and understand the context

rg -n "def build_table_metrics_query" --type py -A 5 -B 2


</function_calls>

#!/bin/bash

Search for all calls to build_table_metrics_query

rg -n "build_table_metrics_query" --type py -C 3


</function_calls>
<function_calls>

#!/bin/bash

Check the metrics structure accesses in the codebase

rg -n 'metrics\s*[' --type py -C 2 | head -80


</function_calls>

Comment on lines +431 to +435
dtype = next(
c["data_type"].lower()
for c in column_info
if c["column_name"] == col_name
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a default to prevent StopIteration exception.

The next() call without a default value will raise StopIteration if col_name is not found in column_info. While this scenario should not occur under normal circumstances, defensive coding warrants a fallback.

Apply this diff to add a default:

-            dtype = next(
-                c["data_type"].lower()
-                for c in column_info
-                if c["column_name"] == col_name
-            )
+            dtype = next(
+                (c["data_type"].lower()
+                for c in column_info
+                if c["column_name"] == col_name),
+                None
+            )
+            
+            if dtype is None:
+                continue  # Skip if column info not found
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dtype = next(
c["data_type"].lower()
for c in column_info
if c["column_name"] == col_name
)
dtype = next(
(c["data_type"].lower()
for c in column_info
if c["column_name"] == col_name),
None
)
if dtype is None:
continue # Skip if column info not found
🤖 Prompt for AI Agents
In dcs_core/integrations/databases/postgres.py around lines 431 to 435, the
next() call used to find dtype will raise StopIteration if col_name is missing;
change the call to provide a default (e.g., next(..., None) or next(...,
"unknown")) and then handle that default: either raise a clear ValueError with
context if dtype is None/unknown, or set dtype to a safe fallback and proceed.
Ensure the chosen fallback is consistent with downstream expectations and
include a brief log/error message when the default branch is taken.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
dcs_core/integrations/databases/postgres.py (2)

436-436: Consider making the threshold configurable.

The hard-coded threshold of 20 for generating distribution graphs might benefit from being a configurable parameter (e.g., class attribute or method parameter), allowing callers to adjust based on their specific use cases.


426-481: Consider combining the three iterations for better performance.

The code iterates through column_wise three times:

  1. Building initial metrics (lines 415-424)
  2. Adding distribution graphs (lines 426-469)
  3. Formatting output (lines 470-481)

These could be combined into a single pass for better performance, especially for tables with many columns.

Example refactoring approach:

column_wise = []
for col in column_info:
    name = col["column_name"]
    dtype = col["data_type"].lower()
    col_metrics = {}
    
    # Collect metrics
    for key, value in row.items():
        if key.startswith(f"{name}_"):
            metric_name = key[len(name) + 1:]
            col_metrics[metric_name] = _normalize_metrics(value)
    
    # Add distribution graph if applicable
    distinct_count = col_metrics.get("distinct")
    if isinstance(distinct_count, (int, float)) and distinct_count < 20:
        quoted = self.quote_column(name)
        group_expr = f"{quoted}::text" if dtype in ("json", "jsonb") else quoted
        dist_query = (
            f"SELECT {group_expr}, COUNT(*) "
            f"FROM {qualified_table} GROUP BY {group_expr} ORDER BY COUNT(*) DESC"
        )
        try:
            dist_result = self.connection.execute(text(dist_query)).fetchall()
            distribution = [
                {"col_val": _normalize_metrics(r[0]), "count": r[1]}
                for r in dist_result
            ]
            col_metrics["distribution_graph"] = distribution
        except Exception as e:
            logger.error(f"Failed to generate distribution graph for column {name}: {e}")
    
    # Format immediately
    formatted_metrics = {
        "general_data": {k: v for k, v in col_metrics.items() if k != "distribution_graph"},
        "distribution_data": col_metrics.get("distribution_graph", []),
    }
    
    column_wise.append({"column_name": name, "metrics": formatted_metrics})

Note: This suggestion assumes the past review comments about logging and exception handling are addressed first.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7350c04 and 13ffe95.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/postgres.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (3)
dcs_core/integrations/databases/mysql.py (1)
  • quote_column (87-93)
dcs_core/integrations/databases/oracle.py (1)
  • quote_column (83-89)
dcs_core/core/datasource/sql_datasource.py (1)
  • quote_column (151-157)
🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py

446-447: Possible SQL injection vector through string-based query construction

(S608)


465-465: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)
dcs_core/integrations/databases/postgres.py (2)

334-342: LGTM! Correct handling of JSON/JSONB for DISTINCT operations.

The ::text cast for JSON/JSONB columns is the appropriate solution for PostgreSQL, which doesn't support direct DISTINCT comparisons on these types. This approach is correctly reused for both DISTINCT counting and GROUP BY operations later in the code.


344-347: LGTM! Consistent use of distinct_expr.

The DISTINCT and duplicate calculations now correctly reference distinct_expr, ensuring JSON/JSONB columns are handled appropriately. The NULL check at line 349 correctly continues to use the original column reference.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
dcs_core/integrations/databases/postgres.py (1)

447-466: Use the project logger instead of print and avoid a fully blind Exception catch.

Catching Exception and printing to stdout makes failures in distribution graph generation hard to trace and bypasses the central logging pipeline. This was already raised on a previous revision.

Consider:

  • Catching a narrower class of errors if possible (e.g., DB/SQLAlchemy-related), and
  • Logging via the same logger used elsewhere in your datasources (e.g., logger.exception(f"Failed to generate distribution graph for column {col_name}")) instead of print, so you keep stack traces and consistent log formatting.

This still keeps metrics collection resilient when a distribution query fails but makes debugging much easier.

🧹 Nitpick comments (3)
dcs_core/integrations/databases/postgres.py (3)

380-409: Deduplicate _normalize_metrics with the shared helper in sql_datasource.

This nested _normalize_metrics implementation appears identical to the helper already defined in dcs_core/core/datasource/sql_datasource.py. Duplicating this logic in multiple places risks future drift if one copy is updated and the other isn’t.

Consider reusing the shared helper instead of redefining it here, e.g. by importing it from sql_datasource and removing this inner function. That also avoids recreating the function on every call to build_table_metrics_query.


423-445: Distribution graph logic is sound; consider minor hardening and cost implications.

The distinct_count < 20 gate is a reasonable way to limit distribution graphs to low-cardinality columns, and the JSON/JSONB ::text handling keeps grouping consistent. Two small points:

  • The dist_query is built via f-strings using group_expr and qualified_table. Since both ultimately come from database metadata rather than user input, SQL injection risk is low, but if you want to harden further, you could make quote_column escape embedded double quotes (e.g., replacing " with "") so odd identifier names can’t break the query.
  • For very large tables, an extra GROUP BY per eligible column can still be expensive even with low distinct counts; if this ever shows up in hot paths, you might want a configurable limit (max rows scanned, max columns with distribution, or a toggle) to keep runtime predictable.

Neither is blocking, but worth considering for robustness and operability.


467-477: Restructuring metrics under general_data / distribution_data may break existing consumers.

Previously, col_data["metrics"] was a flat dict (e.g., metrics["distinct"], metrics["duplicate"], etc.). After this change it becomes:

{
    "general_data": { ...all metrics except distribution_graph... },
    "distribution_data": [...distribution_graph or []...],
}

This is a nicer shape for callers that know about it, but any existing code that accesses metrics["distinct"] or similar will now fail.

Please verify all call sites that consume the output of build_table_metrics_query and either update them or consider keeping backward compatibility (e.g., by leaving metrics flat and adding an additional, separate distribution_data key at the top level).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13ffe95 and e2f9cf0.

📒 Files selected for processing (1)
  • dcs_core/integrations/databases/postgres.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (2)
dcs_core/integrations/databases/mysql.py (1)
  • quote_column (87-93)
dcs_core/core/datasource/sql_datasource.py (1)
  • quote_column (151-157)
🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py

443-444: Possible SQL injection vector through string-based query construction

(S608)


462-462: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)
dcs_core/integrations/databases/postgres.py (1)

331-344: JSON/JSONB-aware DISTINCT handling looks good.

Using quoted::text for JSON/JSONB and reusing distinct_expr for both distinct and duplicate counts is a clean way to make these types groupable without changing the external metric names. No issues from a correctness standpoint here.

@AnshumanX304 AnshumanX304 merged commit 98f63c0 into main Nov 28, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants