-
Notifications
You must be signed in to change notification settings - Fork 22
feat: add build_table_metrics_query method for enhanced column metrics #340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a new method to PostgresDataSource that builds and executes a single aggregated SQL query to compute per-column metrics for a table, then normalizes the single-row result into JSON-serializable Python values and returns a list of per-column dictionaries. (50 words) Changes
Sequence Diagram(s)sequenceDiagram
actor Caller
participant DS as PostgresDataSource
participant PG as PostgreSQL
Caller->>DS: build_table_metrics_query(table_name, column_info, additional_queries?)
DS->>DS: Assemble per-column SQL fragments\n(distinct, duplicate, is_null,\n+ type-specific expressions)
DS->>PG: Execute single aggregated SELECT
PG-->>DS: Single-row aggregated result
DS->>DS: Split row into per-column dicts\nNormalize Decimal/date/datetime/UUID/nested
DS-->>Caller: return List[dict] of column metrics
rect rgb(235,245,255)
note right of DS: New: dynamic SQL assembly\nand JSON-safe normalization
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (5)
dcs_core/integrations/databases/postgres.py (5)
353-355: Avoid materializing result; fetch one mapping row.Prefer
result.mappings().one()(or.first()if empty possible). Also aligns with Ruff hint.- result = self.connection.execute(text(query)) - row = dict(list(result)[0]._mapping) + result = self.connection.execute(text(query)) + row = result.mappings().one()Based on static analysis hints
332-351: Postgres type checks are incomplete/misaligned.
- Numeric: include
real,double precision.int,float,doublearen’tinformation_schema.data_typevalues.- Text: include
character, dropstring.- if dtype in ( - "int", - "integer", - "bigint", - "smallint", - "decimal", - "numeric", - "float", - "double", - ): + if dtype in ( + "integer", + "bigint", + "smallint", + "decimal", + "numeric", + "real", + "double precision", + ): query_parts.append(f'MIN("{name}") AS "{name}_min"') query_parts.append(f'MAX("{name}") AS "{name}_max"') query_parts.append(f'AVG("{name}") AS "{name}_average"') - elif dtype in ("varchar", "text", "char", "string", "character varying"): + elif dtype in ("text", "character", "character varying", "varchar", "char"): query_parts.append( f'MAX(CHAR_LENGTH("{name}")) AS "{name}_max_character_length"' )
320-351: Use quote_column helper consistently.You’re manually quoting identifiers; prefer
self.quote_column(name)to keep quoting logic centralized.Example:
- query_parts.append(f'COUNT(DISTINCT "{name}") AS "{name}_distinct"') + qcol = self.quote_column(name) + query_parts.append(f'COUNT(DISTINCT {qcol}) AS "{name}_distinct"')Apply similarly for all occurrences.
315-318: Method name vs behavior mismatch.It both builds and executes the query. Consider renaming to
fetch_table_metricsor update docstring to reflect execution.
315-351: Operational note: heavy aggregates on wide/large tables.Multiple
COUNT(DISTINCT ...)+MIN/MAX/AVGacross many columns can be expensive. Consider:
- Optional sampling ratio (e.g., TABLESAMPLE) for exploratory metrics.
- Per-column feature flag to limit aggregates.
- Running in off-peak or with statement timeout.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/postgres.py(2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.0)
dcs_core/integrations/databases/postgres.py
354-354: Prefer next(iter(result)) over single element slice
Replace with next(iter(result))
(RUF015)
| if isinstance(value, (datetime, datetime.date)): | ||
| return value.isoformat() | ||
|
|
||
| if isinstance(value, UUID): | ||
| return str(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix isinstance checks for datetime and UUID (runtime error + wrong type).
datetimeinisinstance(..., (datetime, datetime.date))is a module, causingTypeError.UUIDfrom SQLAlchemy isn’t the runtime type returned by drivers; useuuid.UUID.
Apply:
- if isinstance(value, (datetime, datetime.date)):
+ if isinstance(value, (datetime.datetime, datetime.date)):
return value.isoformat()
- if isinstance(value, UUID):
+ if isinstance(value, PyUUID):
return str(value)And adjust imports per separate comment.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In dcs_core/integrations/databases/postgres.py around lines 374 to 378, the
isinstance checks are wrong: `datetime` is a module (causing TypeError) and
`UUID` is the SQLAlchemy type (not the runtime uuid type). Replace the tuple
check with (datetime.datetime, datetime.date) and use uuid.UUID for the UUID
check; update imports to ensure datetime (module) and uuid (module) are
available or import datetime.datetime and uuid.UUID as needed per the other
import comment.
… and improve column quoting
…ional_queries in build_table_metrics_query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
dcs_core/integrations/databases/postgres.py (4)
19-19: Fix the UUID import: use stdlib uuid.UUID, not SQLAlchemy's UUID type.This is the same critical issue flagged in a previous review. The SQLAlchemy
UUIDis a type definition for columns, not the runtime value class returned by database drivers. Runtime UUID values from psycopg areuuid.UUIDfrom the standard library.Apply this diff:
-from sqlalchemy import UUID, create_engine, text +from sqlalchemy import create_engine, text +from uuid import UUID as PyUUIDThen update line 382 to use
PyUUIDinstead ofUUID.
318-321: Guard against empty column_info to prevent invalid SQL generation.This critical issue was flagged in a previous review and remains unresolved. If
column_infois empty, the generated SQL becomesSELECT FROM ..., which is syntactically invalid and will cause a runtime error.Apply this diff:
) -> list[dict]: query_parts = [] + if not column_info: + return [] for col in column_info:
325-327: Fix duplicate count formula: NULLs should not be counted as duplicates.This critical issue was flagged in a previous review and remains unresolved. The formula
COUNT(*) - COUNT(DISTINCT "{name}")incorrectly treats NULLs as duplicates becauseCOUNT(*)includes NULL rows whileCOUNT(DISTINCT "{name}")excludes them.The correct formula excludes NULLs from both counts:
Apply this diff:
- query_parts.append( - f'COUNT(*) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"' - ) + query_parts.append( + f'COUNT({self.quote_column(name)}) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"' + )
379-383: Fix isinstance checks: datetime is a module, and UUID must be from stdlib.This critical issue was flagged in a previous review and remains unresolved. Two problems:
- Line 379:
datetimeis the module, not the class. Using it inisinstancewill raiseTypeError: isinstance() arg 2 must be a type or tuple of types.- Line 382:
UUIDfrom SQLAlchemy is a column type class, not the runtime value type. Database drivers returnuuid.UUIDfrom the standard library.Apply this diff (after fixing the import per the first comment):
- if isinstance(value, (datetime, datetime.date)): + if isinstance(value, (datetime.datetime, datetime.date)): return value.isoformat() - if isinstance(value, UUID): + if isinstance(value, PyUUID): return str(value)
🧹 Nitpick comments (1)
dcs_core/integrations/databases/postgres.py (1)
359-359: Prefernext(iter(result))for efficiency.Creating a full list from the result just to access the first element is inefficient. Use
next(iter(result))instead.Apply this diff:
result = self.connection.execute(text(query)) - row = dict(list(result)[0]._mapping) + row = dict(next(iter(result))._mapping)Based on static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/postgres.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (5)
dcs_core/integrations/databases/mysql.py (2)
quote_column(86-92)qualified_table_name(76-84)dcs_core/integrations/databases/oracle.py (2)
quote_column(80-86)qualified_table_name(70-78)dcs_core/integrations/databases/mssql.py (2)
quote_column(120-126)qualified_table_name(110-118)dcs_core/integrations/databases/sybase.py (2)
quote_column(210-216)qualified_table_name(200-208)dcs_core/core/datasource/sql_datasource.py (2)
quote_column(148-154)qualified_table_name(130-138)
🪛 GitHub Actions: CI
dcs_core/integrations/databases/postgres.py
[error] 1-1: Trailing whitespace detected and removed by pre-commit hook: fix in file.
[error] 356-356: Black formatting failed: cannot parse source due to f-string backslash issue. Possible Python version compatibility problem.
🪛 Ruff (0.14.0)
dcs_core/integrations/databases/postgres.py
359-359: Prefer next(iter(result)) over single element slice
Replace with next(iter(result))
(RUF015)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
dcs_core/integrations/databases/postgres.py (3)
316-322: Guard against empty column_info to prevent invalid SQL.If
column_infois empty, the generated SQL becomesSELECT FROM ..., which is invalid and will cause a runtime error.Apply this diff to add an early guard:
) -> list[dict]: query_parts = [] + if not column_info: + return [] for col in column_info:
328-336: Duplicate count formula treats NULLs as duplicates.Line 332 uses
COUNT(*) - COUNT(DISTINCT ...)which incorrectly treats NULL values as duplicates. The correct formula excludes NULLs.Apply this diff to fix the duplicate count calculation:
query_parts.append( - f'COUNT(*) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"' + f'COUNT({self.quote_column(name)}) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"' )
387-388: Fix isinstance check: datetime is a module, not a type.Line 387 uses
isinstance(value, (datetime, datetime.date))butdatetimeis imported as a module (line 15), not a class. This will raiseTypeError: isinstance() arg 2 must be a type or tuple of typesat runtime.Apply this diff:
- if isinstance(value, (datetime, datetime.date)): + if isinstance(value, (datetime.datetime, datetime.date)): return value.isoformat()
🧹 Nitpick comments (2)
dcs_core/integrations/databases/postgres.py (2)
359-361: Rename loop variable and document SQL injection risk.The loop variable
queries(plural) iterates over individual query strings, which is confusing. Additionally,additional_queriesstrings are concatenated directly without validation, creating a potential SQL injection risk if input is not sanitized by the caller.Apply this diff to improve clarity:
if additional_queries: - for queries in additional_queries: - query_parts.append(queries) + for query in additional_queries: + query_parts.append(query)Consider adding a docstring warning that
additional_queriesmust be pre-validated to prevent SQL injection.
363-367: Usenext(iter(...))instead of list slicing for single-row result.Line 367 materializes the entire result set into a list before taking the first element, which is inefficient. Since the aggregate query returns exactly one row, use
next(iter(result))instead.As per static analysis hints.
Apply this diff:
result = self.connection.execute(text(query)) - row = dict(list(result)[0]._mapping) + row = dict(next(iter(result))._mapping)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/postgres.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (5)
dcs_core/integrations/databases/mysql.py (2)
quote_column(86-92)qualified_table_name(76-84)dcs_core/integrations/databases/oracle.py (2)
quote_column(80-86)qualified_table_name(70-78)dcs_core/integrations/databases/mssql.py (2)
quote_column(120-126)qualified_table_name(110-118)dcs_core/integrations/databases/sybase.py (2)
quote_column(210-216)qualified_table_name(200-208)dcs_core/core/datasource/sql_datasource.py (2)
quote_column(148-154)qualified_table_name(130-138)
🪛 Ruff (0.14.0)
dcs_core/integrations/databases/postgres.py
367-367: Prefer next(iter(result)) over single element slice
Replace with next(iter(result))
(RUF015)
🔇 Additional comments (5)
dcs_core/integrations/databases/postgres.py (5)
15-18: LGTM!The imports are correctly added to support the new functionality.
338-352: LGTM!The numeric type detection and metric calculations (min, max, average) are correctly implemented.
354-357: LGTM!The text type detection and max character length calculation are correctly implemented.
369-398: Well-structured normalization logic with good defensive programming.The
_normalize_metricshelper correctly handles various data types and includes proper None checks, recursive handling of nested structures, and safe fallback to string conversion. The only issue is the datetime isinstance check on line 387 (flagged separately).
400-411: LGTM!The result transformation logic correctly maps the flat row dictionary to a structured per-column format, applying normalization to each metric value.
…t datetime import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
dcs_core/integrations/databases/postgres.py (1)
333-335: Duplicate count is wrong when NULLs exist.
COUNT(*) - COUNT(DISTINCT ...)treats every NULL as a duplicate. UseCOUNT(column)so NULLs are excluded.Apply:
- query_parts.append( - f'COUNT(*) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"' - ) + query_parts.append( + f'COUNT({self.quote_column(name)}) - COUNT(DISTINCT {self.quote_column(name)}) AS "{name}_duplicate"' + )
🧹 Nitpick comments (1)
dcs_core/integrations/databases/postgres.py (1)
368-369: Fetch the single row without materializing the whole result.
list(result)[0]pulls the entire cursor into memory. Useresult.fetchone()(orresult.mappings().first()) to grab the only row more efficiently.Consider:
- result = self.connection.execute(text(query)) - row = dict(list(result)[0]._mapping) + result = self.connection.execute(text(query)) + fetched = result.fetchone() + row = dict(fetched._mapping) if fetched else {}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dcs_core/integrations/databases/postgres.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dcs_core/integrations/databases/postgres.py (2)
dcs_core/integrations/databases/mysql.py (2)
quote_column(86-92)qualified_table_name(76-84)dcs_core/core/datasource/sql_datasource.py (2)
quote_column(148-154)qualified_table_name(130-138)
🪛 Ruff (0.14.0)
dcs_core/integrations/databases/postgres.py
369-369: Prefer next(iter(result)) over single element slice
Replace with next(iter(result))
(RUF015)
User description
…s retrieval
Fixes/Implements
Description
Summary Goes here.
Type of change
Delete irrelevant options.
How Has This Been Tested?
PR Type
Enhancement
Description
Add
build_table_metrics_querymethod to compute column-level statisticsSupport numeric metrics (min, max, average) and text metrics (max character length)
Include distinct count, duplicate count, and null count for all columns
Implement value normalization for JSON serialization (Decimal, datetime, UUID)
Diagram Walkthrough
File Walkthrough
postgres.py
Add column metrics query builder with type normalizationdcs_core/integrations/databases/postgres.py
build_table_metrics_querymethod to generate and execute columnstatistics queries
across all columns
character length for text types
_normalize_metricshelper function to convert database types(Decimal, datetime, UUID) to JSON-serializable formats
Summary by CodeRabbit