feat(extract): add Pascal/Delphi and Lazarus IDE support#781
Conversation
safishamsi
left a comment
There was a problem hiding this comment.
Great PR overall — follows extractor conventions, solid test coverage (28 tests), and the inherits-edge fix is a genuine improvement. Just one thing needed before merge.
Required: add .lpr extension
The PR title promises Lazarus IDE support but .lpr (the Lazarus program file — the entry point of every Lazarus project, identical Pascal syntax to .dpr) is missing from both CODE_EXTENSIONS in detect.py and _DISPATCH in extract.py. This is a one-line fix in each file, mapping .lpr to extract_pascal.
.lpi (Lazarus project info, XML format) can be skipped or added in a follow-up — that's a separate parser.
Minor (non-blocking)
rstrip("()")strips any combination of(and)chars rather than the literal suffix"()"—removesuffix("()")would be more precise (though benign in practice since labels are always"Name()")- Module-level
_pascal_unit_cache/_pascal_class_stem_cachepersist acrossextract()calls in the same process — could leak stale data if files move between runs. Low priority but worth a note.
Once .lpr is added this is a clean merge.
Adds full AST extraction for Pascal and Delphi source files using tree-sitter-pascal (https://github.com/Isopod/tree-sitter-pascal). Supported file extensions: .pas, .pp, .dpr, .dpk, .inc Extracted nodes: - File node (the .pas file itself) - unit / program / library declarations - class, interface, and helper type declarations - procedure and function implementations Extracted edges: - file --contains--> module - module --imports--> dependency (via uses clause, resolved to path-based IDs) - class --inherits--> base class / interface - class/module --contains/method--> procedure or function - procedure --calls--> procedure (in-file call resolution) Key design: uses clause targets are resolved to path-based node IDs by scanning all Pascal files under the project root (_pascal_project_root + _pascal_resolve_unit helpers). This avoids dangling import edges that result from resolving bare unit names like "SysUtils" to IDs that never match any file node. Bare procedure calls (e.g. `Reset;` without parentheses) are detected by inspecting statement nodes whose sole named child is an identifier, in addition to the standard exprCall nodes used for calls with arguments. Requires: pip install tree-sitter-pascal (https://github.com/Isopod/tree-sitter-pascal) If not installed, extract_pascal returns {"nodes":[], "edges":[], "error": ...} so the rest of the pipeline is unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds two new extractors for Lazarus IDE-specific file formats: extract_lazarus_form() — .lfm (Lazarus Form files) .lfm files are text-based UI component trees. The extractor parses `object Name: TClassName ... end` blocks to build a containment graph of form components, and captures `OnXxx = HandlerName` event bindings as `references` edges (context: "event") linking each component to its handler procedure. extract_lazarus_package() — .lpk (Lazarus Package files) .lpk files are XML package definitions. The extractor reads the package name, required package dependencies (→ imports edges), and listed unit files (→ contains edges). Unit names are resolved to path-based node IDs via _pascal_resolve_unit so they connect to the same nodes produced by extract_pascal on .pas files. Both extensions added to CODE_EXTENSIONS in detect.py and to _DISPATCH. 13 new tests in test_pascal.py cover both extractors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…classes The declType/typeref handler built inherits edge targets with _make_id(_read(child)) — just the bare class name. But class nodes use _make_id(stem, type_name), so targets never matched, making the entire class hierarchy invisible in the graph. Add _pascal_class_stem_cache and _pascal_resolve_class(): strips the conventional T/I prefix, locates the defining file by stem lookup (same cache mechanism as _pascal_resolve_unit), and returns the correct _make_id(file_stem, class_name) ID. RTL/unresolvable bases (e.g. TObject) fall back to _make_id(bare_name) with an explicit stub node, following the same pattern as the Python extractor. Also remove the `break` that stopped after the first typeref, so all parents are captured (e.g. class(TBase, IInterface)). Extend test_pascal_no_dangling_edges to also assert that within-file edge targets (contains, method, inherits, calls) resolve to real nodes.
Adds extract_delphi_form() for Delphi Form files (.dfm), which use the same `object Name: TClassName ... end` text syntax as Lazarus .lfm files. Binary .dfm files (FF 0A magic header) are skipped gracefully with an informative error message so the pipeline is unaffected. Text .dfm files are parsed identically to .lfm: component containment (`contains` edges) and event handler references (`references`, context "event"). Adds .dfm to _DISPATCH and CODE_EXTENSIONS. 10 new tests in test_pascal.py, including a regression test for the binary-format detection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Address review feedback from @safishamsi: - Add .lpr (Lazarus program file, identical syntax to .dpr) to _DISPATCH in extract.py and CODE_EXTENSIONS in detect.py so Lazarus project entry points are indexed. Completes the promised Lazarus IDE support. - Replace rstrip("()") with removesuffix("()") in the call-resolution dict comprehension for precise suffix removal (rstrip strips individual characters, not the literal string "()"). - Add .lpr assertions to test_pascal_dispatch_registered and test_pascal_detect_extensions_registered. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7b7aa10 to
a3f1e53
Compare
|
Merged - thank you for your first OSS contribution! Pascal/Delphi/Lazarus support is now in v7. Great to hear graphify is useful for navigating your Delphi project. |
Summary
Adds full knowledge-graph extraction for Pascal/Delphi codebases and
Lazarus IDE project files. Developed with Claude Sonnet 4.6.
1 — Pascal/Delphi extractor (
extract_pascal)New extractor for
.pas,.pp,.dpr,.dpk,.incfiles viatree-sitter-pascal.
Extracted nodes: file, unit/program/library, class/interface/helper,
procedure/function implementations.
Extracted edges:
file → contains → modulemodule → imports → dependency(uses clause, resolved to path-based IDs)class → inherits → base class / interfaceclass/module → contains/method → procedure or functionprocedure → calls → procedure(in-file call resolution, including bareReset;-style calls without parentheses)Key design — import resolution: Uses clause targets are resolved to
path-based node IDs by scanning all Pascal files under the project root
(
_pascal_project_root+_pascal_resolve_unithelpers with caching).Without this, bare unit names like
SysUtilsresolve to IDs that nevermatch any file node, making the entire import graph invisible.
Optional dependency:
pip install tree-sitter-pascalIf not installed,
extract_pascalreturns{"nodes":[], "edges":[], "error":...}and the rest of the pipeline is unaffected.
2 — Inherits-edge fix
The original
declType/typerefhandler builtinheritsedge targetswith
_make_id(_read(child))— just the bare class name. But class nodesuse
_make_id(stem, type_name), so targets never matched, making theentire class hierarchy invisible in the graph.
Fix adds
_pascal_class_stem_cacheand_pascal_resolve_class(): stripsthe conventional
T/Iprefix, locates the defining file by stem lookup(same cache mechanism as
_pascal_resolve_unit), and returns the correct_make_id(file_stem, class_name)ID. RTL/unresolvable bases (e.g.TObject) fall back to_make_id(bare_name)with an explicit stub node,following the same pattern as the Python extractor.
Also removes the
breakthat stopped after the firsttyperef, so allparents of multi-inheritance declarations are captured.
3 — Lazarus form extractor (
extract_lazarus_form,.lfm).lfmfiles are text-based UI component trees. The extractor parsesobject Name: TClassName ... endblocks into a containment graph andcaptures
OnXxx = HandlerNameevent bindings asreferencesedges(context:
"event") linking each component to its handler procedure.4 — Lazarus package extractor (
extract_lazarus_package,.lpk).lpkfiles are XML package definitions. The extractor reads the packagename, required package dependencies (
→ importsedges), and listed unitfiles (
→ containsedges). Unit names are resolved to path-based node IDsvia
_pascal_resolve_unitso they link to the same nodes produced byextract_pascalon.pasfiles.Files changed
graphify/extract.pyextract_pascal,extract_lazarus_form,extract_lazarus_package, resolution helpersgraphify/detect.py.pas .pp .dpr .dpk .inc .lfm .lpktoCODE_EXTENSIONStests/test_pascal.pytests/fixtures/sample.pastests/fixtures/sample.lfmtests/fixtures/sample.lpkPrerequisites
pip install tree-sitter-pascal
Wheel for Windows (cp38 abi3): available at
https://github.com/Isopod/tree-sitter-pascal
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com