feat: Add live PostgreSQL introspection connector (resolves #1093)#1103
feat: Add live PostgreSQL introspection connector (resolves #1093)#1103geektan123 wants to merge 2 commits into
Conversation
safishamsi
left a comment
There was a problem hiding this comment.
Thanks for this @geektan123 - really nice first cut, and exactly the thin-wrapper approach #1093 was after: reconstruct DDL from the catalog and hand it to extract_sql so the whole downstream pipeline stays unchanged. I traced it end-to-end and the core works - FK edges resolve correctly (by table name, so the (id INT) stub is harmless), credentials are correctly kept out of source_file, the catalog queries are static/non-injectable, and the tests genuinely exercise the reconstruction → real tree-sitter parse rather than over-mocking.
A few things to address before merge - two are silent-correctness bugs that the current tests don't catch (all fixtures use clean lowercase names + a single-column FK, so these paths are invisible):
Must-fix
1. Quote identifiers (silent table loss). Schema/table/column names are interpolated raw into the DDL:
ddl.append(f"CREATE TABLE {schema}.{name} (id INT);")A table named order (reserved word), user-data (hyphen), "MixedCase" (case-sensitive), or anything with a space/;/quote produces DDL that tree-sitter-sql misparses - and the table (plus any FK touching it) is silently dropped, no error. Postgres allows almost any character in a quoted identifier, so this will bite real schemas.
- Fix: emit double-quoted identifiers, e.g.
"{schema}"."{name}", doubling any embedded". Please double-check thattable_nidskeying inextract_sqlstays consistent once names are quoted (the.lower()keying may need to account for the quotes).
2. Composite (multi-column) foreign keys over-count. The FK query returns one row per key column, and the code emits a separate single-column ADD CONSTRAINT fk_{schema}_{table}_{col} per row. A 2-column composite FK becomes two constraints → two duplicate references edges between the same table pair (there's no edge dedup), inflating edge counts, and the per-column constraint naming is semantically wrong.
- Fix:
GROUP BYthe constraint and aggregate columns into oneFOREIGN KEY (a, b) REFERENCES t (c, d), named by the constraint rather than the column.
3. Tests for both of the above. Add a fixture with a reserved-word/quoted/special-char table name (assert its node + FK survive) and a composite (2-column) FK (assert exactly one references edge). These are the highest-severity paths and currently untested.
Should-fix (can be follow-up, but ideally here)
4. Graceful connection errors. psycopg.connect(dsn or "") throws a raw OperationalError (full stack trace) on a bad DSN / unreachable host / auth failure. Please catch it and surface a clean error: could not connect to PostgreSQL: ... with a non-zero exit, matching the CLI's existing error style. Be careful the error message doesn't echo the DSN/credentials.
5. Function-body dollar-quote collision. CREATE FUNCTION ... AS $$ {body} $$ breaks if routine_definition itself contains $$ → the function is dropped. Use a unique tag (e.g. $gfx$ ... $gfx$). Also routine_definition is often SQL-language (or NULL for C/internal functions), so hardcoding LANGUAGE plpgsql is sometimes wrong - consider reading external_language / routine_body from information_schema.routines.
Note (not blocking)
The SQL extractor only produces object-level nodes (tables/views/functions), not column nodes - so columns aren't represented in the graph today. #1093 mentioned "columns", so worth a one-line note in the docs that this maps table/view/function structure + FK relationships, not individual columns.
Overall this is close - the design is right and the mechanism is sound. Fix the two quoting/composite-FK bugs (+ tests) and it's good to merge. Thanks again for picking this up so fast!
…nnection error handling - Quote all PostgreSQL identifiers with _quote_ident() to handle reserved words, hyphens, mixed-case, and embedded double-quotes - GROUP BY constraint in FK query + ARRAY_AGG to emit one ALTER TABLE per composite FK (prevents duplicate references edges) - Catch psycopg.OperationalError and re-raise as sanitized ConnectionError (no credentials in message, no raw stack trace) - Use $gfx$ dollar-quote tag to avoid collision with $$ in function bodies - Read external_language from information_schema.routines instead of hardcoding plpgsql - Add tests for quoted/special-char table names and composite FKs - Add docs note that column-level detail is not represented in the graph
8eb5b48 to
d8f79c1
Compare
|
All five items from the review are addressed in d8f79c1. Also added the non-blocking docs note about column-level detail not being represented in the graph. |
… features) #1118 — prune stale AST nodes on full re-extraction (#1116) Stamps every AST-extracted node with _origin="ast" in extract(). On a full rebuild _rebuild_code drops any AST-marked node absent from the fresh output even when its source file survives, fixing stale symbols. Backward-compat: marker-less nodes from pre-1118 graphs survive one cycle then self-heal. #1110 — stop reading images and PDFs as garbage in headless extract Images route through per-backend vision payloads (base64/data-URI/bytes for claude/openai/bedrock); non-vision backends get _strip_pixels for graceful degradation. PDFs reuse pypdf. 5MB cap, 20-image chunk limit. #1159 — Salesforce Apex extractor (.cls, .trigger) Regex-based extractor: classes, interfaces, enums, methods, triggers, SOQL/DML edges. No new dependency. Dispatched as .cls and .trigger. #1107 — Azure OpenAI Service backend (--backend azure) Uses AzureOpenAI SDK client (from existing openai package). Auto-detects when AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT both set. Uses max_completion_tokens (not deprecated max_tokens). #1103 — live PostgreSQL introspection (--postgres DSN) graphify extract --postgres "postgresql://..." introspects tables, views, routines, and FK relations via information_schema (SERIALIZABLE READ ONLY). Credentials sanitized on error. New graphify[postgres] extra (psycopg3). Union-resolved llm.py conflict: Azure functions + bedrock images= param. Fixed test_image_vision.py mock to accept timeout= kwarg (our #1112). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Landed in 7467c1b. What it adds: 5 mocked tests pass (no real PostgreSQL instance needed). 1910 passed, 0 failures. |
This PR implements direct PostgreSQL database introspection support via
graphify extract --postgres "postgresql://...", closing #1093.What it does:
--postgresCLI argument (DSN string or PG* env vars) to extract the schema structure (tables, views, routines, foreign key relations) directly from a running PG instance.pg_dumpround-trip.extract_sqlby converting catalog queries into standard SQL DDL strings in memory, meaning the downstream mapping behaves identically.tests/test_pg_introspect.py) that run without needing a live PG instance.README.mdandARCHITECTURE.md.