feat: Add distribution graph below 20 distinct count [#DCS-1244] #352

Kanai2003 · 2025-11-28T06:27:02Z

…bution graph for low distinct counts

Fixes/Implements

Description

Summary Goes here.

Type of change

Delete irrelevant options.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Locally Tested
Needs Testing From Production

Summary by CodeRabbit

Bug Fixes
- More accurate distinct-value counts for JSON/JSONB columns, improving metric correctness.
New Features
- Column distribution graphs for low-cardinality columns to show value-frequency patterns.
- Distribution generation is non-blocking — failures are logged but don’t stop other metrics.
- Metrics output now separates general metrics from distribution data for clearer results.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…bution graph for low distinct counts

coderabbitai · 2025-11-28T06:27:14Z

Walkthrough

Adds JSON/JSONB-aware DISTINCT handling and per-column distribution_graph generation in table metrics: casts JSON/JSONB to text for distinct/grouping, computes COUNT(DISTINCT ...) using the cast expression, runs small-cardinality distributions, and returns metrics split into general and distribution sections.

Changes

Cohort / File(s)	Summary
PostgreSQL metrics enhancement `dcs_core/integrations/databases/postgres.py`	Updated `build_table_metrics_query`: use a per-column `distinct_expr` (casts `json`/`jsonb` to `::text`) for `COUNT(DISTINCT ...)` and duplicate calculations; add distribution query for columns with distinct count < 20 (produces `distribution_graph` as list of `{col_val, count}`), log failures and continue; post-process metrics into `general_data` and `distribution_data`. Public API unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Focus areas:
- SQL expression construction and proper quoting/casting for distinct_expr and group_expr
- Correctness of distribution query, grouping, ordering, and normalization into distribution_graph
- Performance and edge cases for running additional distribution queries per column
- Error handling where failures are logged to stdout and processing continues
- Verification of final metrics shape (nulls, missing distribution entries)

Possibly related PRs

feat: add build_table_metrics_query method for enhanced column metrics #340 — Prior changes touching build_table_metrics_query; closely related to JSON-aware DISTINCT and distribution logic additions.

Suggested labels

Review effort 3/5

Suggested reviewers

rishav2803
Ryuk-me

Poem

🐰 I hop through rows where nested values hide,
I cast JSON to text and count with pride,
For tiny groups I map each little chart,
Metrics sorted neat — a rabbit's counting art,
Crunching numbers, snacks, and joy inside.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is incomplete and contains placeholder text. The 'Fixes/Implements' section lacks an issue number, the 'Description' section contains only placeholder text 'Summary Goes here.', and all type of change and testing checkboxes are unchecked without clarification.	Complete the description by providing a clear summary, linking the issue number in 'Fixes/Implements #', selecting the appropriate type of change, and confirming testing status.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and concisely summarizes the main change: adding distribution graph functionality for columns with distinct counts below 20.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch distribution-data

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2f9cf0 and d8c7045.

📒 Files selected for processing (1)

dcs_core/integrations/databases/postgres.py (2 hunks)

🧰 Additional context used

🪛 Ruff (0.14.6)

dcs_core/integrations/databases/postgres.py

442-443: Possible SQL injection vector through string-based query construction

(S608)

461-461: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)

dcs_core/integrations/databases/postgres.py (1)

334-343: JSON/JSONB-aware distinct and duplicate metrics look correct.

Using distinct_expr with ::text for JSON/JSONB columns and reusing that in both COUNT(DISTINCT …) and the duplicate metric keeps DISTINCT/duplicate semantics aligned with the distribution logic below. No issues from a correctness perspective here.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dcs_core/integrations/databases/postgres.py (1)
380-380: Fix the f-string formatting error causing pipeline failure.

The f-string expression contains a backslash inside {",\n ".join(...)}, which is not allowed and is causing Black to fail.

Apply this diff to fix the formatting error:
-        query = f'SELECT\n    {",\n    ".join(query_parts)}\nFROM {qualified_table};'
+        separator = ",\n    "
+        query = f'SELECT\n    {separator.join(query_parts)}\nFROM {qualified_table};'

🧹 Nitpick comments (1)

dcs_core/integrations/databases/postgres.py (1)
433-433: Add a default value to the next() call for safety.

If the column is not found in column_info, this will raise StopIteration. While unlikely in practice, adding a default value would make the code more defensive.
-            dtype = next(c["data_type"].lower() for c in column_info if c["column_name"] == col_name)
+            dtype = next((c["data_type"].lower() for c in column_info if c["column_name"] == col_name), None)
+            if dtype is None:
+                continue

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f26f7b and d859779.

📒 Files selected for processing (1)

dcs_core/integrations/databases/postgres.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dcs_core/integrations/databases/postgres.py (5)

dcs_core/integrations/databases/oracle.py (1)

quote_column (83-89)

dcs_core/integrations/databases/mysql.py (1)

quote_column (87-93)

dcs_core/integrations/databases/mssql.py (3)

quote_column (123-129)

fetchall (107-108)

_normalize_metrics (759-775)

dcs_core/integrations/databases/sybase.py (2)

quote_column (210-216)

fetchall (194-195)

dcs_core/core/datasource/sql_datasource.py (2)

quote_column (151-157)

fetchall (120-123)

🪛 GitHub Actions: CI

dcs_core/integrations/databases/postgres.py

[error] 380-380: Black formatting failed on this file (pre-commit hook): cannot format dcs_core/integrations/databases/postgres.py: cannot use --safe with this file; failed to parse source file AST: f-string expression part cannot include a backslash (, line 380). This may be caused by running Black with an older Python version.

🪛 Ruff (0.14.6)

dcs_core/integrations/databases/postgres.py

445-446: Possible SQL injection vector through string-based query construction

(S608)

463-464: try-except-pass detected, consider logging the exception

(S110)

463-463: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)

dcs_core/integrations/databases/postgres.py (2)

334-342: JSON/JSONB handling looks correct.

Casting JSON/JSONB columns to text for DISTINCT operations is the correct approach in PostgreSQL, as JSON values cannot be directly compared without conversion.

343-349: Metric calculations correctly use the JSON-aware distinct expression.

The updated queries properly apply the distinct_expr for both distinct counts and duplicate calculations.

coderabbitai · 2025-11-28T06:30:13Z

dcs_core/integrations/databases/postgres.py

+                try:
+                    dist_result = self.connection.execute(text(dist_query)).fetchall()
+
+                    # ---- UPDATED FORMAT ----
+                    distribution = []
+                    for r in dist_result:
+                        val = _normalize_metrics(r[0])
+                        distribution.append( {
+                            "col_val": val,
+                            "count": r[1],
+                        })
+
+                    metrics["distribution_graph"] = distribution
+
+                except Exception:
+                    pass


⚠️ Potential issue | 🟡 Minor

Add logging or use a more specific exception handler.

The silent exception handling swallows all errors, making it difficult to debug issues with distribution graph generation. While it's reasonable to not fail the entire metrics collection if one distribution fails, consider at least logging the error.

Apply this diff to add basic error logging:

+ import logging + logger = logging.getLogger(__name__) + try: dist_result = self.connection.execute(text(dist_query)).fetchall() # ---- UPDATED FORMAT ---- distribution = [] for r in dist_result: val = _normalize_metrics(r[0]) distribution.append( { "col_val": val, "count": r[1], }) metrics["distribution_graph"] = distribution - except Exception: - pass + except Exception as e: + logger.debug(f"Failed to generate distribution graph for column {col_name}: {e}")

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.6)

463-464: try-except-pass detected, consider logging the exception

(S110)

463-463: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In dcs_core/integrations/databases/postgres.py around lines 449 to 464, the except block silently swallows all exceptions when building the distribution_graph; change it to catch Exception as e and log the error instead of passing (e.g., use self.logger.exception(...) if a logger exists on the class or the module-level logging.exception(...)) so the error and stack trace are recorded while still allowing metrics collection to continue.

…on logging in postgres integration.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

dcs_core/integrations/databases/postgres.py (1)

467-470: Replace print() with proper logging.

Using print() for error output is not suitable for production. This concern was raised in a previous review.

Apply this diff to use logging:

+import logging
+
+logger = logging.getLogger(__name__)
+
                 except Exception as e:
-                    print(
-                        f"Failed to generate distribution graph for column {col_name}: {e}"
-                    )
+                    logger.warning(
+                        "Failed to generate distribution graph for column %s: %s",
+                        col_name,
+                        e,
+                    )

🧹 Nitpick comments (2)

dcs_core/integrations/databases/postgres.py (2)
431-435: Consider adding a default value to next() for defensive coding.

If col_name is somehow not found in column_info, this will raise StopIteration. While unlikely since both come from the same source, a default would add robustness.
             dtype = next(
-                c["data_type"].lower()
-                for c in column_info
-                if c["column_name"] == col_name
+                (c["data_type"].lower()
+                 for c in column_info
+                 if c["column_name"] == col_name),
+                None
             )
+            if dtype is None:
+                continue
437-449: Performance consideration: Multiple queries for distribution graphs.

For tables with many low-cardinality columns, this executes a separate GROUP BY query per column. On large tables, this could add latency. Consider batching or documenting the trade-off.

If performance becomes an issue, columns could be batched using UNION ALL:
SELECT 'col_a' AS col, col_a::text AS val, COUNT(*) FROM t GROUP BY col_a
UNION ALL
SELECT 'col_b' AS col, col_b::text AS val, COUNT(*) FROM t GROUP BY col_b

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d859779 and a11a16f.

📒 Files selected for processing (1)

dcs_core/integrations/databases/postgres.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dcs_core/integrations/databases/postgres.py (5)

dcs_core/integrations/databases/mysql.py (1)

quote_column (87-93)

dcs_core/integrations/databases/oracle.py (1)

quote_column (83-89)

dcs_core/core/datasource/sql_datasource.py (2)

quote_column (151-157)

fetchall (120-123)

dcs_core/integrations/databases/mssql.py (2)

quote_column (123-129)

fetchall (107-108)

dcs_core/integrations/databases/sybase.py (2)

quote_column (210-216)

fetchall (194-195)

🪛 Ruff (0.14.6)

dcs_core/integrations/databases/postgres.py

447-448: Possible SQL injection vector through string-based query construction

(S608)

467-467: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)

dcs_core/integrations/databases/postgres.py (2)

334-341: LGTM! Correct handling of JSON/JSONB for DISTINCT operations.

PostgreSQL doesn't support direct equality comparison for JSON/JSONB types, so casting to ::text is the appropriate solution for DISTINCT counting.

343-350: LGTM! Consistent use of distinct_expr for DISTINCT-based metrics.

The NULL check correctly continues to use the original quoted column since IS NULL works directly on JSON/JSONB types.

coderabbitai · 2025-11-28T07:09:05Z

dcs_core/integrations/databases/postgres.py

+                dist_query = (
+                    f"SELECT {group_expr}, COUNT(*) "
+                    f"FROM {qualified_table} GROUP BY {group_expr} ORDER BY COUNT(*) DESC"
+                )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's locate and examine the postgres.py file around lines 446-449 cat -n dcs_core/integrations/databases/postgres.py | sed -n '430,470p'

Repository: datachecks/dcs-core

Length of output: 1865

🏁 Script executed:

#!/bin/bash # Find the quote_column implementation in postgres.py rg -n "def quote_column" dcs_core/integrations/databases/postgres.py -A 5

Repository: datachecks/dcs-core

Length of output: 260

🏁 Script executed:

#!/bin/bash # Check for quote_column implementations across all datasources rg -n "def quote_column" dcs_core/integrations/databases/ -A 3 --type py

Repository: datachecks/dcs-core

Length of output: 2268

🏁 Script executed:

#!/bin/bash # Check for logging imports in postgres.py rg -n "^import|^from" dcs_core/integrations/databases/postgres.py | head -20

Repository: datachecks/dcs-core

Length of output: 502

🏁 Script executed:

#!/bin/bash # Find the function containing this code to understand column_info origin rg -n "def.*distribution" dcs_core/integrations/databases/postgres.py -B 5 -A 40 | head -60

Repository: datachecks/dcs-core

Length of output: 45

🏁 Script executed:

#!/bin/bash # Get full quote_column implementation in postgres.py sed -n '69,80p' dcs_core/integrations/databases/postgres.py

Repository: datachecks/dcs-core

Length of output: 393

🏁 Script executed:

#!/bin/bash # Get quote_column implementations from other datasources to compare echo "=== MySQL quote_column ===" && sed -n '87,95p' dcs_core/integrations/databases/mysql.py echo "" echo "=== Oracle quote_column ===" && sed -n '83,91p' dcs_core/integrations/databases/oracle.py

Repository: datachecks/dcs-core

Length of output: 605

🏁 Script executed:

#!/bin/bash # Search for the function containing the distribution graph code rg -n "def.*" dcs_core/integrations/databases/postgres.py | rg -B 5 -A 5 "430:"

Repository: datachecks/dcs-core

Length of output: 45

🏁 Script executed:

#!/bin/bash # Find what function contains line 430 awk '/^[[:space:]]*(def|class) / { func=$0; line=NR } NR==430 { print func " at line " line }' dcs_core/integrations/databases/postgres.py

Repository: datachecks/dcs-core

Length of output: 232

🏁 Script executed:

#!/bin/bash # Better approach: find function by looking backwards from line 430 sed -n '1,430p' dcs_core/integrations/databases/postgres.py | tail -50 | grep -n "def "</method> to see function name

Repository: datachecks/dcs-core

Length of output: 116

🏁 Script executed:

#!/bin/bash # Find the function containing line 430 by searching backwards sed -n '400,430p' dcs_core/integrations/databases/postgres.py | grep -E "^\s*(def|class) " | tail -1

Repository: datachecks/dcs-core

Length of output: 45

🏁 Script executed:

#!/bin/bash # Alternative: look at more context before line 430 sed -n '350,435p' dcs_core/integrations/databases/postgres.py | head -100

Repository: datachecks/dcs-core

Length of output: 3081

🏁 Script executed:

#!/bin/bash # Search for where column_info is used and its origin rg -n "column_info" dcs_core/integrations/databases/postgres.py -B 3 -A 3 | head -40

Repository: datachecks/dcs-core

Length of output: 1403

🏁 Script executed:

#!/bin/bash # Check if logging is used elsewhere in postgres.py for error handling rg -n "logging\.|logger\." dcs_core/integrations/databases/postgres.py

Repository: datachecks/dcs-core

Length of output: 45

🏁 Script executed:

#!/bin/bash # Check for other exception handlers using print() in postgres.py rg -n "except.*:|print\(" dcs_core/integrations/databases/postgres.py | grep -A 2 "except"

Repository: datachecks/dcs-core

Length of output: 280

🏁 Script executed:

#!/bin/bash # Check how other datasources handle exceptions (logging vs print) rg -n "except.*:|print\(|logging\." dcs_core/integrations/databases/mysql.py | head -20

Repository: datachecks/dcs-core

Length of output: 131

🏁 Script executed:

#!/bin/bash # Check if logging is imported/used in the base SQLDataSource class rg -n "import logging|logger" dcs_core/core/datasource/sql_datasource.py | head -10

Repository: datachecks/dcs-core

Length of output: 477

🏁 Script executed:

#!/bin/bash # Check what the exception handlers in mysql.py do sed -n '72,80p' dcs_core/integrations/databases/mysql.py sed -n '423,430p' dcs_core/integrations/databases/mysql.py

Repository: datachecks/dcs-core

Length of output: 647

Use loguru logger instead of print() for exception handling; SQL injection risk is low but inherent to quote_column pattern.

The distribution graph query at lines 446-449 has two concerns:

Exception handling (lines 467-470): Uses print() instead of the logging pattern established in the base SQLDataSource class, which imports and uses loguru.logger. Other datasources like mysql.py use logger.error() correctly. Replace with logger.error() for consistency.

SQL injection via unescaped quotes: The quote_column() method returns f'"{column}"' without escaping embedded double-quotes. While col_name originates from database metadata (trusted source), PostgreSQL allows embedded quotes in identifier names. This is a pre-existing pattern across all datasources (mysql uses backticks, oracle uses double quotes—none escape). The risk is low given the trusted data source, but if hardening is desired, escape embedded quotes: f'"{column.replace(chr(34), chr(34)+chr(34))}"'.

🧰 Tools

🪛 Ruff (0.14.6)

447-448: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents

In dcs_core/integrations/databases/postgres.py around lines 446-470, replace the use of print() in the exception handler with the module's loguru logger (logger.error(...)) and include the exception info in the log call to match the SQLDataSource logging pattern; additionally, harden quote_column to escape embedded double-quotes by doubling them (e.g., replace " with "" before wrapping in quotes) so identifiers containing quotes are safely quoted.

…ion graph handling

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

dcs_core/integrations/databases/postgres.py (1)
467-470: Replace print() with logger for consistency.

The codebase uses loguru.logger (imported in the base SQLDataSource class) for error handling, but this exception handler uses print(). This inconsistency was flagged in a previous review but remains unresolved.

Apply this diff to use the logger:
+                from loguru import logger
+
                 except Exception as e:
-                    print(
-                        f"Failed to generate distribution graph for column {col_name}: {e}"
-                    )
+                    logger.error(
+                        f"Failed to generate distribution graph for column {col_name}: {e}"
+                    )
Alternatively, if a logger instance is already available in the class, use self.logger instead.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a11a16f and 7350c04.

📒 Files selected for processing (1)

dcs_core/integrations/databases/postgres.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dcs_core/integrations/databases/postgres.py (4)

dcs_core/integrations/databases/mysql.py (1)

quote_column (87-93)

dcs_core/core/datasource/sql_datasource.py (2)

quote_column (151-157)

fetchall (120-123)

dcs_core/integrations/databases/oracle.py (1)

quote_column (83-89)

dcs_core/integrations/databases/sybase.py (2)

quote_column (210-216)

fetchall (194-195)

🪛 Ruff (0.14.6)

dcs_core/integrations/databases/postgres.py

447-448: Possible SQL injection vector through string-based query construction

(S608)

467-467: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (4)

dcs_core/integrations/databases/postgres.py (4)
334-341: LGTM! JSON/JSONB handling for DISTINCT is correct.

The cast to ::text for JSON/JSONB types is necessary since PostgreSQL cannot directly use DISTINCT on JSON columns. The implementation is clean and well-commented.

344-346: LGTM! Correct use of distinct_expr for DISTINCT operations.

The metrics correctly use distinct_expr which handles JSON/JSONB casting, while the NULL check at line 349 appropriately uses the original quoted column without casting.

437-437: Confirm the distinct count threshold is intentional.

The condition distinct_count < 20 generates distribution graphs for columns with 0-19 distinct values. Verify this threshold aligns with the intended behavior (the PR title says "below 20" which matches, but worth confirming 20 exactly should be excluded).

If the intention is to include 20, change to:
-            if isinstance(distinct_count, (int, float)) and distinct_count < 20:
+            if isinstance(distinct_count, (int, float)) and distinct_count <= 20:
472-483: I'll help you verify the metrics structure change and its impact on consumers. Let me start by exploring the codebase to understand the scope of this change.
<function_calls>

#!/bin/bash

Find the method definition and understand the context

rg -n "def build_table_metrics_query" --type py -A 5 -B 2

</function_calls>

#!/bin/bash

Search for all calls to build_table_metrics_query

rg -n "build_table_metrics_query" --type py -C 3

</function_calls>
<function_calls>

#!/bin/bash

Check the metrics structure accesses in the codebase

rg -n 'metrics\s*[' --type py -C 2 | head -80

</function_calls>

coderabbitai · 2025-11-28T08:46:00Z

dcs_core/integrations/databases/postgres.py

+            dtype = next(
+                c["data_type"].lower()
+                for c in column_info
+                if c["column_name"] == col_name
+            )


⚠️ Potential issue | 🟡 Minor

Add a default to prevent StopIteration exception.

The next() call without a default value will raise StopIteration if col_name is not found in column_info. While this scenario should not occur under normal circumstances, defensive coding warrants a fallback.

Apply this diff to add a default:

- dtype = next( - c["data_type"].lower() - for c in column_info - if c["column_name"] == col_name - ) + dtype = next( + (c["data_type"].lower() + for c in column_info + if c["column_name"] == col_name), + None + ) + + if dtype is None: + continue # Skip if column info not found

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

dtype = next(

c["data_type"].lower()

for c in column_info

if c["column_name"] == col_name

)

dtype = next(

(c["data_type"].lower()

for c in column_info

if c["column_name"] == col_name),

None

)

if dtype is None:

continue # Skip if column info not found

🤖 Prompt for AI Agents

In dcs_core/integrations/databases/postgres.py around lines 431 to 435, the next() call used to find dtype will raise StopIteration if col_name is missing; change the call to provide a default (e.g., next(..., None) or next(..., "unknown")) and then handle that default: either raise a clear ValueError with context if dtype is None/unknown, or set dtype to a safe fallback and proceed. Ensure the chosen fallback is consistent with downstream expectations and include a brief log/error message when the default branch is taken.

…ogic

…andling for JSON columns

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

dcs_core/integrations/databases/postgres.py (2)

436-436: Consider making the threshold configurable.

The hard-coded threshold of 20 for generating distribution graphs might benefit from being a configurable parameter (e.g., class attribute or method parameter), allowing callers to adjust based on their specific use cases.

426-481: Consider combining the three iterations for better performance.

The code iterates through column_wise three times:

Building initial metrics (lines 415-424)
Adding distribution graphs (lines 426-469)
Formatting output (lines 470-481)

These could be combined into a single pass for better performance, especially for tables with many columns.

Example refactoring approach:

column_wise = []
for col in column_info:
    name = col["column_name"]
    dtype = col["data_type"].lower()
    col_metrics = {}
    
    # Collect metrics
    for key, value in row.items():
        if key.startswith(f"{name}_"):
            metric_name = key[len(name) + 1:]
            col_metrics[metric_name] = _normalize_metrics(value)
    
    # Add distribution graph if applicable
    distinct_count = col_metrics.get("distinct")
    if isinstance(distinct_count, (int, float)) and distinct_count < 20:
        quoted = self.quote_column(name)
        group_expr = f"{quoted}::text" if dtype in ("json", "jsonb") else quoted
        dist_query = (
            f"SELECT {group_expr}, COUNT(*) "
            f"FROM {qualified_table} GROUP BY {group_expr} ORDER BY COUNT(*) DESC"
        )
        try:
            dist_result = self.connection.execute(text(dist_query)).fetchall()
            distribution = [
                {"col_val": _normalize_metrics(r[0]), "count": r[1]}
                for r in dist_result
            ]
            col_metrics["distribution_graph"] = distribution
        except Exception as e:
            logger.error(f"Failed to generate distribution graph for column {name}: {e}")
    
    # Format immediately
    formatted_metrics = {
        "general_data": {k: v for k, v in col_metrics.items() if k != "distribution_graph"},
        "distribution_data": col_metrics.get("distribution_graph", []),
    }
    
    column_wise.append({"column_name": name, "metrics": formatted_metrics})

Note: This suggestion assumes the past review comments about logging and exception handling are addressed first.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7350c04 and 13ffe95.

📒 Files selected for processing (1)

dcs_core/integrations/databases/postgres.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dcs_core/integrations/databases/postgres.py (3)

dcs_core/integrations/databases/mysql.py (1)

quote_column (87-93)

dcs_core/integrations/databases/oracle.py (1)

quote_column (83-89)

dcs_core/core/datasource/sql_datasource.py (1)

quote_column (151-157)

🪛 Ruff (0.14.6)

dcs_core/integrations/databases/postgres.py

446-447: Possible SQL injection vector through string-based query construction

(S608)

465-465: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)

dcs_core/integrations/databases/postgres.py (2)

334-342: LGTM! Correct handling of JSON/JSONB for DISTINCT operations.

The ::text cast for JSON/JSONB columns is the appropriate solution for PostgreSQL, which doesn't support direct DISTINCT comparisons on these types. This approach is correctly reused for both DISTINCT counting and GROUP BY operations later in the code.

344-347: LGTM! Consistent use of distinct_expr.

The DISTINCT and duplicate calculations now correctly reference distinct_expr, ensuring JSON/JSONB columns are handled appropriately. The NULL check at line 349 correctly continues to use the original column reference.

…resDataSource

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

dcs_core/integrations/databases/postgres.py (1)

447-466: Use the project logger instead of print and avoid a fully blind Exception catch.

Catching Exception and printing to stdout makes failures in distribution graph generation hard to trace and bypasses the central logging pipeline. This was already raised on a previous revision.

Consider:

Catching a narrower class of errors if possible (e.g., DB/SQLAlchemy-related), and

Logging via the same logger used elsewhere in your datasources (e.g., logger.exception(f"Failed to generate distribution graph for column {col_name}")) instead of print, so you keep stack traces and consistent log formatting.

This still keeps metrics collection resilient when a distribution query fails but makes debugging much easier.

🧹 Nitpick comments (3)

dcs_core/integrations/databases/postgres.py (3)
380-409: Deduplicate _normalize_metrics with the shared helper in sql_datasource.

This nested _normalize_metrics implementation appears identical to the helper already defined in dcs_core/core/datasource/sql_datasource.py. Duplicating this logic in multiple places risks future drift if one copy is updated and the other isn’t.

Consider reusing the shared helper instead of redefining it here, e.g. by importing it from sql_datasource and removing this inner function. That also avoids recreating the function on every call to build_table_metrics_query.

423-445: Distribution graph logic is sound; consider minor hardening and cost implications.

The distinct_count < 20 gate is a reasonable way to limit distribution graphs to low-cardinality columns, and the JSON/JSONB ::text handling keeps grouping consistent. Two small points:

The dist_query is built via f-strings using group_expr and qualified_table. Since both ultimately come from database metadata rather than user input, SQL injection risk is low, but if you want to harden further, you could make quote_column escape embedded double quotes (e.g., replacing " with "") so odd identifier names can’t break the query.

For very large tables, an extra GROUP BY per eligible column can still be expensive even with low distinct counts; if this ever shows up in hot paths, you might want a configurable limit (max rows scanned, max columns with distribution, or a toggle) to keep runtime predictable.

Neither is blocking, but worth considering for robustness and operability.

467-477: Restructuring metrics under general_data / distribution_data may break existing consumers.

Previously, col_data["metrics"] was a flat dict (e.g., metrics["distinct"], metrics["duplicate"], etc.). After this change it becomes:
{
    "general_data": { ...all metrics except distribution_graph... },
    "distribution_data": [...distribution_graph or []...],
}
This is a nicer shape for callers that know about it, but any existing code that accesses metrics["distinct"] or similar will now fail.

Please verify all call sites that consume the output of build_table_metrics_query and either update them or consider keeping backward compatibility (e.g., by leaving metrics flat and adding an additional, separate distribution_data key at the top level).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13ffe95 and e2f9cf0.

📒 Files selected for processing (1)

dcs_core/integrations/databases/postgres.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dcs_core/integrations/databases/postgres.py (2)

dcs_core/integrations/databases/mysql.py (1)

quote_column (87-93)

dcs_core/core/datasource/sql_datasource.py (1)

quote_column (151-157)

🪛 Ruff (0.14.6)

dcs_core/integrations/databases/postgres.py

443-444: Possible SQL injection vector through string-based query construction

(S608)

462-462: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)

dcs_core/integrations/databases/postgres.py (1)

331-344: JSON/JSONB-aware DISTINCT handling looks good.

Using quoted::text for JSON/JSONB and reusing distinct_expr for both distinct and duplicate counts is a clean way to make these types groupable without changing the external metric names. No issues from a correctness standpoint here.

feat: enhance distinct count handling for JSON columns and add distri…

d859779

…bution graph for low distinct counts

Kanai2003 changed the title ~~feat: Add distribution graph below 20 distinct count~~ feat: Add distribution graph below 20 distinct count [#DCS-1244] Nov 28, 2025

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

chore: add environment and installation log files and enhance excepti…

a11a16f

…on logging in postgres integration.

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

feat: format metrics data structure for improved clarity in distribut…

7350c04

…ion graph handling

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

Kanai2003 added 2 commits November 28, 2025 16:48

refactor: remove commented-out code and clean up distribution graph l…

13ffe95

…ogic

refactor: remove outdated comments and simplify distinct expression h…

e2f9cf0

…andling for JSON columns

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

refactor: remove unnecessary comment regarding JSON grouping in Postg…

d8c7045

…resDataSource

AnshumanX304 approved these changes Nov 28, 2025

View reviewed changes

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

AnshumanX304 merged commit 98f63c0 into main Nov 28, 2025
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add distribution graph below 20 distinct count [#DCS-1244] #352

feat: Add distribution graph below 20 distinct count [#DCS-1244] #352

Uh oh!

Kanai2003 commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 28, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 28, 2025

Uh oh!

coderabbitai bot left a comment

Find the method definition and understand the context

Search for all calls to build_table_metrics_query

Check the metrics structure accesses in the codebase

Uh oh!

coderabbitai bot Nov 28, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add distribution graph below 20 distinct count [#DCS-1244] #352

feat: Add distribution graph below 20 distinct count [#DCS-1244] #352

Uh oh!

Conversation

Kanai2003 commented Nov 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes/Implements

Description

Type of change

How Has This Been Tested?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Find the method definition and understand the context

Search for all calls to build_table_metrics_query

Check the metrics structure accesses in the codebase

Uh oh!

coderabbitai bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kanai2003 commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 28, 2025 •

edited

Loading