-
Notifications
You must be signed in to change notification settings - Fork 22
feat: Add distribution graph below 20 distinct count [#DCS-1244] #352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…bution graph for low distinct counts
WalkthroughAdds JSON/JSONB-aware DISTINCT handling and per-column distribution_graph generation in table metrics: casts JSON/JSONB to text for distinct/grouping, computes COUNT(DISTINCT ...) using the cast expression, runs small-cardinality distributions, and returns metrics split into general and distribution sections. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🪛 Ruff (0.14.6)dcs_core/integrations/databases/postgres.py442-443: Possible SQL injection vector through string-based query construction (S608) 461-461: Do not catch blind exception: (BLE001) 🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dcs_core/integrations/databases/postgres.py (1)
380-380: Fix the f-string formatting error causing pipeline failure.The f-string expression contains a backslash inside
{",\n ".join(...)}, which is not allowed and is causing Black to fail.Apply this diff to fix the formatting error:
- query = f'SELECT\n {",\n ".join(query_parts)}\nFROM {qualified_table};' + separator = ",\n " + query = f'SELECT\n {separator.join(query_parts)}\nFROM {qualified_table};'
🧹 Nitpick comments (1)
dcs_core/integrations/databases/postgres.py (1)
433-433: Add a default value to thenext()call for safety.If the column is not found in
column_info, this will raiseStopIteration. While unlikely in practice, adding a default value would make the code more defensive.- dtype = next(c["data_type"].lower() for c in column_info if c["column_name"] == col_name) + dtype = next((c["data_type"].lower() for c in column_info if c["column_name"] == col_name), None) + if dtype is None: + continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/postgres.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (5)
dcs_core/integrations/databases/oracle.py (1)
quote_column(83-89)dcs_core/integrations/databases/mysql.py (1)
quote_column(87-93)dcs_core/integrations/databases/mssql.py (3)
quote_column(123-129)fetchall(107-108)_normalize_metrics(759-775)dcs_core/integrations/databases/sybase.py (2)
quote_column(210-216)fetchall(194-195)dcs_core/core/datasource/sql_datasource.py (2)
quote_column(151-157)fetchall(120-123)
🪛 GitHub Actions: CI
dcs_core/integrations/databases/postgres.py
[error] 380-380: Black formatting failed on this file (pre-commit hook): cannot format dcs_core/integrations/databases/postgres.py: cannot use --safe with this file; failed to parse source file AST: f-string expression part cannot include a backslash (, line 380). This may be caused by running Black with an older Python version.
🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py
445-446: Possible SQL injection vector through string-based query construction
(S608)
463-464: try-except-pass detected, consider logging the exception
(S110)
463-463: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (2)
dcs_core/integrations/databases/postgres.py (2)
334-342: JSON/JSONB handling looks correct.Casting JSON/JSONB columns to text for DISTINCT operations is the correct approach in PostgreSQL, as JSON values cannot be directly compared without conversion.
343-349: Metric calculations correctly use the JSON-aware distinct expression.The updated queries properly apply the
distinct_exprfor both distinct counts and duplicate calculations.
| try: | ||
| dist_result = self.connection.execute(text(dist_query)).fetchall() | ||
|
|
||
| # ---- UPDATED FORMAT ---- | ||
| distribution = [] | ||
| for r in dist_result: | ||
| val = _normalize_metrics(r[0]) | ||
| distribution.append( { | ||
| "col_val": val, | ||
| "count": r[1], | ||
| }) | ||
|
|
||
| metrics["distribution_graph"] = distribution | ||
|
|
||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add logging or use a more specific exception handler.
The silent exception handling swallows all errors, making it difficult to debug issues with distribution graph generation. While it's reasonable to not fail the entire metrics collection if one distribution fails, consider at least logging the error.
Apply this diff to add basic error logging:
+ import logging
+ logger = logging.getLogger(__name__)
+
try:
dist_result = self.connection.execute(text(dist_query)).fetchall()
# ---- UPDATED FORMAT ----
distribution = []
for r in dist_result:
val = _normalize_metrics(r[0])
distribution.append( {
"col_val": val,
"count": r[1],
})
metrics["distribution_graph"] = distribution
- except Exception:
- pass
+ except Exception as e:
+ logger.debug(f"Failed to generate distribution graph for column {col_name}: {e}")Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.6)
463-464: try-except-pass detected, consider logging the exception
(S110)
463-463: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In dcs_core/integrations/databases/postgres.py around lines 449 to 464, the
except block silently swallows all exceptions when building the
distribution_graph; change it to catch Exception as e and log the error instead
of passing (e.g., use self.logger.exception(...) if a logger exists on the class
or the module-level logging.exception(...)) so the error and stack trace are
recorded while still allowing metrics collection to continue.
…on logging in postgres integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
dcs_core/integrations/databases/postgres.py (1)
467-470: Replaceprint()with proper logging.Using
print()for error output is not suitable for production. This concern was raised in a previous review.Apply this diff to use logging:
+import logging + +logger = logging.getLogger(__name__) + except Exception as e: - print( - f"Failed to generate distribution graph for column {col_name}: {e}" - ) + logger.warning( + "Failed to generate distribution graph for column %s: %s", + col_name, + e, + )
🧹 Nitpick comments (2)
dcs_core/integrations/databases/postgres.py (2)
431-435: Consider adding a default value tonext()for defensive coding.If
col_nameis somehow not found incolumn_info, this will raiseStopIteration. While unlikely since both come from the same source, a default would add robustness.dtype = next( - c["data_type"].lower() - for c in column_info - if c["column_name"] == col_name + (c["data_type"].lower() + for c in column_info + if c["column_name"] == col_name), + None ) + if dtype is None: + continue
437-449: Performance consideration: Multiple queries for distribution graphs.For tables with many low-cardinality columns, this executes a separate
GROUP BYquery per column. On large tables, this could add latency. Consider batching or documenting the trade-off.If performance becomes an issue, columns could be batched using
UNION ALL:SELECT 'col_a' AS col, col_a::text AS val, COUNT(*) FROM t GROUP BY col_a UNION ALL SELECT 'col_b' AS col, col_b::text AS val, COUNT(*) FROM t GROUP BY col_b
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/postgres.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (5)
dcs_core/integrations/databases/mysql.py (1)
quote_column(87-93)dcs_core/integrations/databases/oracle.py (1)
quote_column(83-89)dcs_core/core/datasource/sql_datasource.py (2)
quote_column(151-157)fetchall(120-123)dcs_core/integrations/databases/mssql.py (2)
quote_column(123-129)fetchall(107-108)dcs_core/integrations/databases/sybase.py (2)
quote_column(210-216)fetchall(194-195)
🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py
447-448: Possible SQL injection vector through string-based query construction
(S608)
467-467: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (2)
dcs_core/integrations/databases/postgres.py (2)
334-341: LGTM! Correct handling of JSON/JSONB for DISTINCT operations.PostgreSQL doesn't support direct equality comparison for JSON/JSONB types, so casting to
::textis the appropriate solution for DISTINCT counting.
343-350: LGTM! Consistent use ofdistinct_exprfor DISTINCT-based metrics.The NULL check correctly continues to use the original quoted column since
IS NULLworks directly on JSON/JSONB types.
| dist_query = ( | ||
| f"SELECT {group_expr}, COUNT(*) " | ||
| f"FROM {qualified_table} GROUP BY {group_expr} ORDER BY COUNT(*) DESC" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's locate and examine the postgres.py file around lines 446-449
cat -n dcs_core/integrations/databases/postgres.py | sed -n '430,470p'Repository: datachecks/dcs-core
Length of output: 1865
🏁 Script executed:
#!/bin/bash
# Find the quote_column implementation in postgres.py
rg -n "def quote_column" dcs_core/integrations/databases/postgres.py -A 5Repository: datachecks/dcs-core
Length of output: 260
🏁 Script executed:
#!/bin/bash
# Check for quote_column implementations across all datasources
rg -n "def quote_column" dcs_core/integrations/databases/ -A 3 --type pyRepository: datachecks/dcs-core
Length of output: 2268
🏁 Script executed:
#!/bin/bash
# Check for logging imports in postgres.py
rg -n "^import|^from" dcs_core/integrations/databases/postgres.py | head -20Repository: datachecks/dcs-core
Length of output: 502
🏁 Script executed:
#!/bin/bash
# Find the function containing this code to understand column_info origin
rg -n "def.*distribution" dcs_core/integrations/databases/postgres.py -B 5 -A 40 | head -60Repository: datachecks/dcs-core
Length of output: 45
🏁 Script executed:
#!/bin/bash
# Get full quote_column implementation in postgres.py
sed -n '69,80p' dcs_core/integrations/databases/postgres.pyRepository: datachecks/dcs-core
Length of output: 393
🏁 Script executed:
#!/bin/bash
# Get quote_column implementations from other datasources to compare
echo "=== MySQL quote_column ===" && sed -n '87,95p' dcs_core/integrations/databases/mysql.py
echo ""
echo "=== Oracle quote_column ===" && sed -n '83,91p' dcs_core/integrations/databases/oracle.pyRepository: datachecks/dcs-core
Length of output: 605
🏁 Script executed:
#!/bin/bash
# Search for the function containing the distribution graph code
rg -n "def.*" dcs_core/integrations/databases/postgres.py | rg -B 5 -A 5 "430:"Repository: datachecks/dcs-core
Length of output: 45
🏁 Script executed:
#!/bin/bash
# Find what function contains line 430
awk '/^[[:space:]]*(def|class) / { func=$0; line=NR } NR==430 { print func " at line " line }' dcs_core/integrations/databases/postgres.pyRepository: datachecks/dcs-core
Length of output: 232
🏁 Script executed:
#!/bin/bash
# Better approach: find function by looking backwards from line 430
sed -n '1,430p' dcs_core/integrations/databases/postgres.py | tail -50 | grep -n "def "</method> to see function nameRepository: datachecks/dcs-core
Length of output: 116
🏁 Script executed:
#!/bin/bash
# Find the function containing line 430 by searching backwards
sed -n '400,430p' dcs_core/integrations/databases/postgres.py | grep -E "^\s*(def|class) " | tail -1Repository: datachecks/dcs-core
Length of output: 45
🏁 Script executed:
#!/bin/bash
# Alternative: look at more context before line 430
sed -n '350,435p' dcs_core/integrations/databases/postgres.py | head -100Repository: datachecks/dcs-core
Length of output: 3081
🏁 Script executed:
#!/bin/bash
# Search for where column_info is used and its origin
rg -n "column_info" dcs_core/integrations/databases/postgres.py -B 3 -A 3 | head -40Repository: datachecks/dcs-core
Length of output: 1403
🏁 Script executed:
#!/bin/bash
# Check if logging is used elsewhere in postgres.py for error handling
rg -n "logging\.|logger\." dcs_core/integrations/databases/postgres.pyRepository: datachecks/dcs-core
Length of output: 45
🏁 Script executed:
#!/bin/bash
# Check for other exception handlers using print() in postgres.py
rg -n "except.*:|print\(" dcs_core/integrations/databases/postgres.py | grep -A 2 "except"Repository: datachecks/dcs-core
Length of output: 280
🏁 Script executed:
#!/bin/bash
# Check how other datasources handle exceptions (logging vs print)
rg -n "except.*:|print\(|logging\." dcs_core/integrations/databases/mysql.py | head -20Repository: datachecks/dcs-core
Length of output: 131
🏁 Script executed:
#!/bin/bash
# Check if logging is imported/used in the base SQLDataSource class
rg -n "import logging|logger" dcs_core/core/datasource/sql_datasource.py | head -10Repository: datachecks/dcs-core
Length of output: 477
🏁 Script executed:
#!/bin/bash
# Check what the exception handlers in mysql.py do
sed -n '72,80p' dcs_core/integrations/databases/mysql.py
sed -n '423,430p' dcs_core/integrations/databases/mysql.pyRepository: datachecks/dcs-core
Length of output: 647
Use loguru logger instead of print() for exception handling; SQL injection risk is low but inherent to quote_column pattern.
The distribution graph query at lines 446-449 has two concerns:
-
Exception handling (lines 467-470): Uses
print()instead of the logging pattern established in the baseSQLDataSourceclass, which imports and usesloguru.logger. Other datasources like mysql.py uselogger.error()correctly. Replace withlogger.error()for consistency. -
SQL injection via unescaped quotes: The
quote_column()method returnsf'"{column}"'without escaping embedded double-quotes. Whilecol_nameoriginates from database metadata (trusted source), PostgreSQL allows embedded quotes in identifier names. This is a pre-existing pattern across all datasources (mysql uses backticks, oracle uses double quotes—none escape). The risk is low given the trusted data source, but if hardening is desired, escape embedded quotes:f'"{column.replace(chr(34), chr(34)+chr(34))}"'.
🧰 Tools
🪛 Ruff (0.14.6)
447-448: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
In dcs_core/integrations/databases/postgres.py around lines 446-470, replace the
use of print() in the exception handler with the module's loguru logger
(logger.error(...)) and include the exception info in the log call to match the
SQLDataSource logging pattern; additionally, harden quote_column to escape
embedded double-quotes by doubling them (e.g., replace " with "" before wrapping
in quotes) so identifiers containing quotes are safely quoted.
…ion graph handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
dcs_core/integrations/databases/postgres.py (1)
467-470: Replace print() with logger for consistency.The codebase uses
loguru.logger(imported in the baseSQLDataSourceclass) for error handling, but this exception handler usesprint(). This inconsistency was flagged in a previous review but remains unresolved.Apply this diff to use the logger:
+ from loguru import logger + except Exception as e: - print( - f"Failed to generate distribution graph for column {col_name}: {e}" - ) + logger.error( + f"Failed to generate distribution graph for column {col_name}: {e}" + )Alternatively, if a logger instance is already available in the class, use
self.loggerinstead.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/postgres.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (4)
dcs_core/integrations/databases/mysql.py (1)
quote_column(87-93)dcs_core/core/datasource/sql_datasource.py (2)
quote_column(151-157)fetchall(120-123)dcs_core/integrations/databases/oracle.py (1)
quote_column(83-89)dcs_core/integrations/databases/sybase.py (2)
quote_column(210-216)fetchall(194-195)
🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py
447-448: Possible SQL injection vector through string-based query construction
(S608)
467-467: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (4)
dcs_core/integrations/databases/postgres.py (4)
334-341: LGTM! JSON/JSONB handling for DISTINCT is correct.The cast to
::textfor JSON/JSONB types is necessary since PostgreSQL cannot directly use DISTINCT on JSON columns. The implementation is clean and well-commented.
344-346: LGTM! Correct use of distinct_expr for DISTINCT operations.The metrics correctly use
distinct_exprwhich handles JSON/JSONB casting, while the NULL check at line 349 appropriately uses the original quoted column without casting.
437-437: Confirm the distinct count threshold is intentional.The condition
distinct_count < 20generates distribution graphs for columns with 0-19 distinct values. Verify this threshold aligns with the intended behavior (the PR title says "below 20" which matches, but worth confirming 20 exactly should be excluded).If the intention is to include 20, change to:
- if isinstance(distinct_count, (int, float)) and distinct_count < 20: + if isinstance(distinct_count, (int, float)) and distinct_count <= 20:
472-483: I'll help you verify the metrics structure change and its impact on consumers. Let me start by exploring the codebase to understand the scope of this change.
<function_calls>
#!/bin/bashFind the method definition and understand the context
rg -n "def build_table_metrics_query" --type py -A 5 -B 2
</function_calls>
#!/bin/bashSearch for all calls to build_table_metrics_query
rg -n "build_table_metrics_query" --type py -C 3
</function_calls>
<function_calls>
#!/bin/bashCheck the metrics structure accesses in the codebase
rg -n 'metrics\s*[' --type py -C 2 | head -80
</function_calls>
| dtype = next( | ||
| c["data_type"].lower() | ||
| for c in column_info | ||
| if c["column_name"] == col_name | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a default to prevent StopIteration exception.
The next() call without a default value will raise StopIteration if col_name is not found in column_info. While this scenario should not occur under normal circumstances, defensive coding warrants a fallback.
Apply this diff to add a default:
- dtype = next(
- c["data_type"].lower()
- for c in column_info
- if c["column_name"] == col_name
- )
+ dtype = next(
+ (c["data_type"].lower()
+ for c in column_info
+ if c["column_name"] == col_name),
+ None
+ )
+
+ if dtype is None:
+ continue # Skip if column info not found📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| dtype = next( | |
| c["data_type"].lower() | |
| for c in column_info | |
| if c["column_name"] == col_name | |
| ) | |
| dtype = next( | |
| (c["data_type"].lower() | |
| for c in column_info | |
| if c["column_name"] == col_name), | |
| None | |
| ) | |
| if dtype is None: | |
| continue # Skip if column info not found |
🤖 Prompt for AI Agents
In dcs_core/integrations/databases/postgres.py around lines 431 to 435, the
next() call used to find dtype will raise StopIteration if col_name is missing;
change the call to provide a default (e.g., next(..., None) or next(...,
"unknown")) and then handle that default: either raise a clear ValueError with
context if dtype is None/unknown, or set dtype to a safe fallback and proceed.
Ensure the chosen fallback is consistent with downstream expectations and
include a brief log/error message when the default branch is taken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
dcs_core/integrations/databases/postgres.py (2)
436-436: Consider making the threshold configurable.The hard-coded threshold of
20for generating distribution graphs might benefit from being a configurable parameter (e.g., class attribute or method parameter), allowing callers to adjust based on their specific use cases.
426-481: Consider combining the three iterations for better performance.The code iterates through
column_wisethree times:
- Building initial metrics (lines 415-424)
- Adding distribution graphs (lines 426-469)
- Formatting output (lines 470-481)
These could be combined into a single pass for better performance, especially for tables with many columns.
Example refactoring approach:
column_wise = [] for col in column_info: name = col["column_name"] dtype = col["data_type"].lower() col_metrics = {} # Collect metrics for key, value in row.items(): if key.startswith(f"{name}_"): metric_name = key[len(name) + 1:] col_metrics[metric_name] = _normalize_metrics(value) # Add distribution graph if applicable distinct_count = col_metrics.get("distinct") if isinstance(distinct_count, (int, float)) and distinct_count < 20: quoted = self.quote_column(name) group_expr = f"{quoted}::text" if dtype in ("json", "jsonb") else quoted dist_query = ( f"SELECT {group_expr}, COUNT(*) " f"FROM {qualified_table} GROUP BY {group_expr} ORDER BY COUNT(*) DESC" ) try: dist_result = self.connection.execute(text(dist_query)).fetchall() distribution = [ {"col_val": _normalize_metrics(r[0]), "count": r[1]} for r in dist_result ] col_metrics["distribution_graph"] = distribution except Exception as e: logger.error(f"Failed to generate distribution graph for column {name}: {e}") # Format immediately formatted_metrics = { "general_data": {k: v for k, v in col_metrics.items() if k != "distribution_graph"}, "distribution_data": col_metrics.get("distribution_graph", []), } column_wise.append({"column_name": name, "metrics": formatted_metrics})Note: This suggestion assumes the past review comments about logging and exception handling are addressed first.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/postgres.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (3)
dcs_core/integrations/databases/mysql.py (1)
quote_column(87-93)dcs_core/integrations/databases/oracle.py (1)
quote_column(83-89)dcs_core/core/datasource/sql_datasource.py (1)
quote_column(151-157)
🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py
446-447: Possible SQL injection vector through string-based query construction
(S608)
465-465: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (2)
dcs_core/integrations/databases/postgres.py (2)
334-342: LGTM! Correct handling of JSON/JSONB for DISTINCT operations.The
::textcast for JSON/JSONB columns is the appropriate solution for PostgreSQL, which doesn't support direct DISTINCT comparisons on these types. This approach is correctly reused for both DISTINCT counting and GROUP BY operations later in the code.
344-347: LGTM! Consistent use of distinct_expr.The DISTINCT and duplicate calculations now correctly reference
distinct_expr, ensuring JSON/JSONB columns are handled appropriately. The NULL check at line 349 correctly continues to use the original column reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
dcs_core/integrations/databases/postgres.py (1)
447-466: Use the project logger instead ofExceptioncatch.Catching
Exceptionand printing to stdout makes failures in distribution graph generation hard to trace and bypasses the central logging pipeline. This was already raised on a previous revision.Consider:
- Catching a narrower class of errors if possible (e.g., DB/SQLAlchemy-related), and
- Logging via the same logger used elsewhere in your datasources (e.g.,
logger.exception(f"Failed to generate distribution graph for column {col_name}")) instead ofThis still keeps metrics collection resilient when a distribution query fails but makes debugging much easier.
🧹 Nitpick comments (3)
dcs_core/integrations/databases/postgres.py (3)
380-409: Deduplicate_normalize_metricswith the shared helper insql_datasource.This nested
_normalize_metricsimplementation appears identical to the helper already defined indcs_core/core/datasource/sql_datasource.py. Duplicating this logic in multiple places risks future drift if one copy is updated and the other isn’t.Consider reusing the shared helper instead of redefining it here, e.g. by importing it from
sql_datasourceand removing this inner function. That also avoids recreating the function on every call tobuild_table_metrics_query.
423-445: Distribution graph logic is sound; consider minor hardening and cost implications.The
distinct_count < 20gate is a reasonable way to limit distribution graphs to low-cardinality columns, and the JSON/JSONB::texthandling keeps grouping consistent. Two small points:
- The
dist_queryis built via f-strings usinggroup_exprandqualified_table. Since both ultimately come from database metadata rather than user input, SQL injection risk is low, but if you want to harden further, you could makequote_columnescape embedded double quotes (e.g., replacing"with"") so odd identifier names can’t break the query.- For very large tables, an extra
GROUP BYper eligible column can still be expensive even with low distinct counts; if this ever shows up in hot paths, you might want a configurable limit (max rows scanned, max columns with distribution, or a toggle) to keep runtime predictable.Neither is blocking, but worth considering for robustness and operability.
467-477: Restructuringmetricsundergeneral_data/distribution_datamay break existing consumers.Previously,
col_data["metrics"]was a flat dict (e.g.,metrics["distinct"],metrics["duplicate"], etc.). After this change it becomes:{ "general_data": { ...all metrics except distribution_graph... }, "distribution_data": [...distribution_graph or []...], }This is a nicer shape for callers that know about it, but any existing code that accesses
metrics["distinct"]or similar will now fail.Please verify all call sites that consume the output of
build_table_metrics_queryand either update them or consider keeping backward compatibility (e.g., by leavingmetricsflat and adding an additional, separatedistribution_datakey at the top level).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/postgres.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (2)
dcs_core/integrations/databases/mysql.py (1)
quote_column(87-93)dcs_core/core/datasource/sql_datasource.py (1)
quote_column(151-157)
🪛 Ruff (0.14.6)
dcs_core/integrations/databases/postgres.py
443-444: Possible SQL injection vector through string-based query construction
(S608)
462-462: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (1)
dcs_core/integrations/databases/postgres.py (1)
331-344: JSON/JSONB-aware DISTINCT handling looks good.Using
quoted::textfor JSON/JSONB and reusingdistinct_exprfor both distinct and duplicate counts is a clean way to make these types groupable without changing the external metric names. No issues from a correctness standpoint here.
…bution graph for low distinct counts
Fixes/Implements
Description
Summary Goes here.
Type of change
Delete irrelevant options.
How Has This Been Tested?
Summary by CodeRabbit
Bug Fixes
New Features
✏️ Tip: You can customize this high-level summary in your review settings.