Skip to content

feat: add pascal optional extra for tree-sitter-pascal#1616

Open
vinicius-l-machado wants to merge 1 commit into
Graphify-Labs:v8from
vinicius-l-machado:feat/pascal-delphi-extractor
Open

feat: add pascal optional extra for tree-sitter-pascal#1616
vinicius-l-machado wants to merge 1 commit into
Graphify-Labs:v8from
vinicius-l-machado:feat/pascal-delphi-extractor

Conversation

@vinicius-l-machado

Copy link
Copy Markdown

extract_pascal() already imports tree-sitter-pascal for AST-quality extraction and falls back to a regex extractor when it is absent (#781), but the grammar was not declared anywhere in the package metadata, so it was never installed and the AST path never ran out of the box.

Declare a pascal extra (and add it to all) so users can opt into the AST extractor with uv tool install "graphifyy[pascal]". tree-sitter-pascal publishes prebuilt wheels for every platform (win/macOS/Linux), so unlike the dm extra it needs no C toolchain.

On a mid-size Delphi codebase the AST path yields notably more accurate relationship edges than the regex fallback (calls and inherits both up ~25%). README extras table and uv.lock updated accordingly.

extract_pascal() already imports tree-sitter-pascal for AST-quality
extraction and falls back to a regex extractor when it is absent (Graphify-Labs#781),
but the grammar was not declared anywhere in the package metadata, so it
was never installed and the AST path never ran out of the box.

Declare a `pascal` extra (and add it to `all`) so users can opt into the
AST extractor with `uv tool install "graphifyy[pascal]"`. tree-sitter-pascal
publishes prebuilt wheels for every platform (win/macOS/Linux), so unlike
the `dm` extra it needs no C toolchain.

On a mid-size Delphi codebase the AST path yields notably more accurate
relationship edges than the regex fallback (calls and inherits both up
~25%). README extras table and uv.lock updated accordingly.
nokternol added a commit to nokternol/graphify that referenced this pull request Jul 4, 2026
…iling (Graphify-Labs#1616)

`graphify explain "<phrase>"` treats its whole argument as one string that
must match/prefix/substring a single node's label as a whole — so a genuine
natural-language phrase (e.g. "critic score aggregation") returns "No node
matching found" even when every individual word exists on a real, relevant
node, because no node label ever literally contains the entire multi-word
phrase. This silently dead-ends on exactly the query shape `explain` is
otherwise suggested for, with no fallback and no signal that anything went
wrong (worse than noise: a hard, silent zero).

When the tiered lookup finds nothing and the phrase has more than one token,
`explain` now falls back to the same per-token bag-of-words scoring `query`
already uses (`_score_nodes`) and lists the top candidates by term overlap,
in the same numbered-candidate format the existing ambiguity guard (Graphify-Labs#1613)
uses, instead of a bare dead end. A genuine single-word miss is unaffected —
gated on token count, since a one-word probe would score identically to the
substring tier already tried and has nothing new to find.

Regression tests: multi-word phrase with real term overlap surfaces
candidates and excludes unrelated nodes; multi-word phrase with zero overlap
still gets the honest original message; single-word miss is byte-identical
to prior behavior. Full suite (2766 tests, 1 pre-existing unrelated failure)
and ruff pass. Verified live against a real repo's graph.json: both
previously-zero `explain` queries now surface their real target
(`ratingsAggregation.ts`, `backdrops.handler.ts`) instead of nothing.
nokternol added a commit to nokternol/graphify that referenced this pull request Jul 4, 2026
…ify-Labs#1618)

Graphify-Labs#1616's term-overlap fallback (this same session) fixed `explain` hard-
failing to zero on multi-word natural-language phrases, but has its own
failure mode: when a query's only shared vocabulary with the corpus is one
generic word, every node containing that word ties at the weakest possible
bonus tier, and the fallback presents an arbitrary top-10 slice of that tie
as though it were a considered answer. Live repro: "server startup error
handling" matched 1,765 of this repo's 3,491 nodes (51%) — "server" is also
this repo's top-level backend directory name — with the real target buried
past rank 800, tied with 1,627 other nodes at the exact same floor score.
That's not a useful answer, it's close to a coin flip dressed up as one.

Fix: after scoring, if the candidate count exceeds both an absolute floor
(50) and 15% of the graph's total node count, treat it as a noise flood and
fall back to the same honest zero-match message a genuine miss gets, instead
of printing a misleadingly specific candidate list. The floor keeps this
from firing on small graphs/fixtures, where even "most of the graph matched"
can be a small, legitimate list. Genuine large-but-real candidate lists
(e.g. 31 candidates on this repo's ~3,491-node graph, an earlier fix's
verified-good case) stay well under the threshold and are unaffected.

Regression tests: a 60-of-61-node noise flood on one generic token now gets
the honest no-match message; a 20-of-21-node case (below this graph size's
threshold) still shows its candidate list normally, confirming the guard is
for degenerate floods specifically, not just "more than 10 results." Full
suite (2769 tests, all passing this run — the one known pre-existing
test-order flake did not trigger) and ruff pass. Verified live: the exact
1,765-candidate flood from earlier now returns the honest no-match message;
smaller legitimate fallbacks (critic score aggregation, backdrop image
selection) are unaffected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant