Skip to content

feat: Add live PostgreSQL introspection connector (resolves #1093)#1103

Closed
geektan123 wants to merge 2 commits into
Graphify-Labs:v8from
geektan123:feat/postgres-live-introspection
Closed

feat: Add live PostgreSQL introspection connector (resolves #1093)#1103
geektan123 wants to merge 2 commits into
Graphify-Labs:v8from
geektan123:feat/postgres-live-introspection

Conversation

@geektan123

Copy link
Copy Markdown
Contributor

This PR implements direct PostgreSQL database introspection support via graphify extract --postgres "postgresql://...", closing #1093.

What it does:

  • Adds a --postgres CLI argument (DSN string or PG* env vars) to extract the schema structure (tables, views, routines, foreign key relations) directly from a running PG instance.
  • Avoids the manual pg_dump round-trip.
  • Reuses extract_sql by converting catalog queries into standard SQL DDL strings in memory, meaning the downstream mapping behaves identically.
  • Correctly skips LLM API requirements when only introspecting a database schema.
  • Added comprehensive mocking unit tests (tests/test_pg_introspect.py) that run without needing a live PG instance.
  • Fully documented in README.md and ARCHITECTURE.md.

@safishamsi safishamsi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @geektan123 - really nice first cut, and exactly the thin-wrapper approach #1093 was after: reconstruct DDL from the catalog and hand it to extract_sql so the whole downstream pipeline stays unchanged. I traced it end-to-end and the core works - FK edges resolve correctly (by table name, so the (id INT) stub is harmless), credentials are correctly kept out of source_file, the catalog queries are static/non-injectable, and the tests genuinely exercise the reconstruction → real tree-sitter parse rather than over-mocking.

A few things to address before merge - two are silent-correctness bugs that the current tests don't catch (all fixtures use clean lowercase names + a single-column FK, so these paths are invisible):

Must-fix

1. Quote identifiers (silent table loss). Schema/table/column names are interpolated raw into the DDL:

ddl.append(f"CREATE TABLE {schema}.{name} (id INT);")

A table named order (reserved word), user-data (hyphen), "MixedCase" (case-sensitive), or anything with a space/;/quote produces DDL that tree-sitter-sql misparses - and the table (plus any FK touching it) is silently dropped, no error. Postgres allows almost any character in a quoted identifier, so this will bite real schemas.

  • Fix: emit double-quoted identifiers, e.g. "{schema}"."{name}", doubling any embedded ". Please double-check that table_nids keying in extract_sql stays consistent once names are quoted (the .lower() keying may need to account for the quotes).

2. Composite (multi-column) foreign keys over-count. The FK query returns one row per key column, and the code emits a separate single-column ADD CONSTRAINT fk_{schema}_{table}_{col} per row. A 2-column composite FK becomes two constraints → two duplicate references edges between the same table pair (there's no edge dedup), inflating edge counts, and the per-column constraint naming is semantically wrong.

  • Fix: GROUP BY the constraint and aggregate columns into one FOREIGN KEY (a, b) REFERENCES t (c, d), named by the constraint rather than the column.

3. Tests for both of the above. Add a fixture with a reserved-word/quoted/special-char table name (assert its node + FK survive) and a composite (2-column) FK (assert exactly one references edge). These are the highest-severity paths and currently untested.

Should-fix (can be follow-up, but ideally here)

4. Graceful connection errors. psycopg.connect(dsn or "") throws a raw OperationalError (full stack trace) on a bad DSN / unreachable host / auth failure. Please catch it and surface a clean error: could not connect to PostgreSQL: ... with a non-zero exit, matching the CLI's existing error style. Be careful the error message doesn't echo the DSN/credentials.

5. Function-body dollar-quote collision. CREATE FUNCTION ... AS $$ {body} $$ breaks if routine_definition itself contains $$ → the function is dropped. Use a unique tag (e.g. $gfx$ ... $gfx$). Also routine_definition is often SQL-language (or NULL for C/internal functions), so hardcoding LANGUAGE plpgsql is sometimes wrong - consider reading external_language / routine_body from information_schema.routines.

Note (not blocking)

The SQL extractor only produces object-level nodes (tables/views/functions), not column nodes - so columns aren't represented in the graph today. #1093 mentioned "columns", so worth a one-line note in the docs that this maps table/view/function structure + FK relationships, not individual columns.

Overall this is close - the design is right and the mechanism is sound. Fix the two quoting/composite-FK bugs (+ tests) and it's good to merge. Thanks again for picking this up so fast!

Tanay added 2 commits June 2, 2026 11:05
…nnection error handling

- Quote all PostgreSQL identifiers with _quote_ident() to handle reserved
  words, hyphens, mixed-case, and embedded double-quotes
- GROUP BY constraint in FK query + ARRAY_AGG to emit one ALTER TABLE per
  composite FK (prevents duplicate references edges)
- Catch psycopg.OperationalError and re-raise as sanitized ConnectionError
  (no credentials in message, no raw stack trace)
- Use $gfx$ dollar-quote tag to avoid collision with $$ in function bodies
- Read external_language from information_schema.routines instead of
  hardcoding plpgsql
- Add tests for quoted/special-char table names and composite FKs
- Add docs note that column-level detail is not represented in the graph
@geektan123 geektan123 force-pushed the feat/postgres-live-introspection branch from 8eb5b48 to d8f79c1 Compare June 2, 2026 05:39
@geektan123

Copy link
Copy Markdown
Contributor Author

All five items from the review are addressed in d8f79c1. Also added the non-blocking docs note about column-level detail not being represented in the graph.

@geektan123 geektan123 requested a review from safishamsi June 2, 2026 05:43
safishamsi added a commit that referenced this pull request Jun 7, 2026
… features)

#1118 — prune stale AST nodes on full re-extraction (#1116)
Stamps every AST-extracted node with _origin="ast" in extract(). On a
full rebuild _rebuild_code drops any AST-marked node absent from the
fresh output even when its source file survives, fixing stale symbols.
Backward-compat: marker-less nodes from pre-1118 graphs survive one
cycle then self-heal.

#1110 — stop reading images and PDFs as garbage in headless extract
Images route through per-backend vision payloads (base64/data-URI/bytes
for claude/openai/bedrock); non-vision backends get _strip_pixels for
graceful degradation. PDFs reuse pypdf. 5MB cap, 20-image chunk limit.

#1159 — Salesforce Apex extractor (.cls, .trigger)
Regex-based extractor: classes, interfaces, enums, methods, triggers,
SOQL/DML edges. No new dependency. Dispatched as .cls and .trigger.

#1107 — Azure OpenAI Service backend (--backend azure)
Uses AzureOpenAI SDK client (from existing openai package). Auto-detects
when AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT both set. Uses
max_completion_tokens (not deprecated max_tokens).

#1103 — live PostgreSQL introspection (--postgres DSN)
graphify extract --postgres "postgresql://..." introspects tables, views,
routines, and FK relations via information_schema (SERIALIZABLE READ ONLY).
Credentials sanitized on error. New graphify[postgres] extra (psycopg3).

Union-resolved llm.py conflict: Azure functions + bedrock images= param.
Fixed test_image_vision.py mock to accept timeout= kwarg (our #1112).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@safishamsi

Copy link
Copy Markdown
Collaborator

Landed in 7467c1b.

What it adds: graphify extract --postgres "postgresql://..." introspects a live PostgreSQL database directly — no pg_dump roundtrip needed. Extracts tables, views, routines, and foreign key relations via information_schema in a SERIALIZABLE READ ONLY transaction. Credentials are sanitized from error messages. New graphify[postgres] optional extra (psycopg3).

5 mocked tests pass (no real PostgreSQL instance needed). 1910 passed, 0 failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants