Skip to content

extract: AttributeError ('list' object has no attribute 'get') at merge when a semantic chunk fails — partial results discarded, semantic cache write also fails #1631

Description

@ssazy

Summary

graphify extract crashes with AttributeError: 'list' object has no attribute 'get' at the merge step when a semantic chunk fails and partial results are returned — discarding all successful chunks. The semantic cache write fails with the same error just before the crash.

Environment

  • graphifyy 0.9.5, source build from v8 @ cf4b4ef85a72c407b5e1cb5e0678faa0497a2747
  • Python 3.12 (venv), Ubuntu 24.04
  • Backend: --backend ollama --model hermes3:8b --token-budget 4000, GRAPHIFY_OLLAMA_NUM_CTX=8192, GRAPHIFY_MAX_OUTPUT_TOKENS=3072
  • Corpus: ~119 markdown docs + 23 code files → 34 chunks

What happened

33/34 chunks succeeded; 1 chunk failed (request timeout → bisect exhausted). Then:

[graphify] WARNING: 1/34 semantic chunk(s) failed — see errors above. Partial results returned.
[graphify extract] warning: could not write semantic cache: 'list' object has no attribute 'get'
Traceback (most recent call last):
  File "~/graphify-trial/venv/bin/graphify", line 6, in <module>
    sys.exit(main())
  File ".../site-packages/graphify/__main__.py", line 4860, in main
    e.get("source_file", "") for e in sem_result.get("edges", [])
AttributeError: 'list' object has no attribute 'get'

No graph.json is written; all successful extraction work is lost. Because the semantic cache write fails too, a re-run re-extracts everything.

Root cause (from reading the source)

sem_result["edges"] can contain a list entry instead of a dict — a malformed LLM response (JSON array where an edge object belongs) that slips past response validation, apparently on the failed-chunk/partial path. The _sem_extracted comprehension at __main__.py:4860 then calls .get() on it. The semantic-cache writer iterates the same entries and fails the same way (caught, warning only).

Suggested fix

Normalize/sanitize collected semantic results before cache-write/merge (or harden per-entry validation where fresh chunk results are extended), e.g.:

for k in ("nodes", "edges", "hyperedges"):
    sem_result[k] = [x for x in sem_result.get(k, []) if isinstance(x, dict)]

Workaround we're running

The 4-line sanitize above inserted just before the # Merge AST + semantic ... block — with it, the same corpus completes: partial results flow through, graph.json written, 1 failed chunk re-queues incrementally as designed (#933 comment behavior).

Happy to provide more detail. Thanks for graphify — the local-first design (tree-sitter + ollama backend) is exactly why we adopted it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions