Skip to content

Split extract.py (11.6k lines, 45 language extractors) into per-language modules #1212

Description

@nucleusjay

Summary

graphify/extract.py is 11,657 lines and contains 45 language extractors in one file. graphify/__main__.py is the same pattern at a smaller scale: a single 2,500-line main() with 40 command branches in one if/elif chain.

This is the largest maintenance risk in the repo. Reviewing a one-language change requires loading the entire file. Test isolation is harder than it should be. New contributors adding a 46th language have to navigate the whole file to find the right place.

Proposed shape

graphify/
  extractors/
    __init__.py       # registry: LANGUAGE_EXTRACTORS = {\"python\": PythonExtractor, ...}
    base.py           # shared Extractor protocol + common helpers
    python.py
    javascript.py
    typescript.py
    ...
    rust.py
  extract.py          # thin orchestrator: pick extractor by language, run it

Same pattern for __main__.py:

graphify/
  commands/
    __init__.py       # COMMANDS = {\"merge-chunks\": merge_chunks, ...}
    merge_chunks.py
    merge_semantic.py
    ...
  __main__.py         # thin dispatcher

Why this is worth doing now

  • The hand-rolled if/elif dispatch in __main__.py already produced Fix two latent bugs: merge-chunks output and manifest data loss #1207 (a typo'd len()) -- the kind of bug that's harder to spot in a 2.5k-line function than in a 30-line one.
  • File-level git blame is useless on extract.py; every commit touches the whole file.
  • A per-language file makes it obvious which languages are well-tested and which aren't.

Risk

Big diff. Not a one-PR job. Suggest doing it language-by-language: extract one extractor, prove the registry works, then move the others in batches.

Scope check

Worth confirming with maintainer before anyone starts -- if there's a deliberate reason for the monolithic shape (e.g. cold-start time, single-file installability), close this and document the reason.

Surfaced during an external code review pass.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions