fix(health): surface unreadable source metas as lint findings by plind-junior · Pull Request #341 · vouchdev/vouch

plind-junior · 2026-07-02T07:40:36Z

93f5836 made storage.list_sources() skip artifacts it cannot parse so one corrupt file no longer crashes bulk listings, and pinned utf-8 so a non-utf-8 locale no longer mis-decodes healthy files into mojibake at read time. the skip leaves a gap for files that really are corrupt on disk: a hand edit or an external writer under a mismatched locale can land a raw control character pyyaml rejects, and such a meta then vanishes from the lint sweep with only a log warning while any claim citing the source is misreported as broken_citation.

lint now loads source metas per-file like _load_claims_for_lint does for claims: a meta that fails to parse becomes an unreadable_source error finding with a repair hint, and the source id (its directory name) still counts as present so citation checks point at the actual problem instead of a phantom missing source.

What changed

Why

What might break

VEP

Tests

make check passes locally (lint + mypy + pytest)
New / changed behaviour has a test
CHANGELOG.md updated under ## [Unreleased]

Summary by CodeRabbit

Bug Fixes
- Improved source citation checking so sources with unreadable metadata are now reported clearly instead of being silently ignored.
- Validation now distinguishes between a missing source and a source whose metadata exists but cannot be parsed, helping keep integrity checks accurate.
- Lint results now surface these unreadable-source issues directly and avoid false citation errors for sources that are still present.

93f5836 made storage.list_sources() skip artifacts it cannot parse so one corrupt file no longer crashes bulk listings, and pinned utf-8 so a non-utf-8 locale no longer mis-decodes healthy files into mojibake at read time. the skip leaves a gap for files that really are corrupt on disk: a hand edit or an external writer under a mismatched locale can land a raw control character pyyaml rejects, and such a meta then vanishes from the lint sweep with only a log warning while any claim citing the source is misreported as broken_citation. lint now loads source metas per-file like _load_claims_for_lint does for claims: a meta that fails to parse becomes an unreadable_source error finding with a repair hint, and the source id (its directory name) still counts as present so citation checks point at the actual problem instead of a phantom missing source.

coderabbitai · 2026-07-02T07:40:56Z

📝 Walkthrough

Walkthrough

The lint() function in health.py now uses a new helper, _collect_sources_for_lint(), to gather present source IDs from sources/*/meta.yaml, adding unreadable_source error findings for meta files that fail to parse while still treating those sources as present for citation checks. A regression test validates this behavior.

Changes

Unreadable Source Meta Lint Handling

Layer / File(s)	Summary
Collect sources and surface unreadable_source findings `src/vouch/health.py`	Adds `Source` import and `_collect_sources_for_lint()` helper that scans source meta files, records present sources, emits `unreadable_source` error findings for unparseable meta.yaml, and merges these into `lint()`'s findings and `sources_present` set.
Regression test for unreadable source meta `tests/test_health.py`	Adds a test corrupting a source's meta.yaml, asserting lint reports `unreadable_source`, marks report `ok is False`, and does not report `broken_citation` for the still-present source.

Estimated code review effort: 2 (Simple) | ~10 minutes

Sequence Diagram(s)

sequenceDiagram
    participant Test as test_health.py
    participant Lint as health.lint()
    participant Collector as _collect_sources_for_lint()
    participant FS as sources/*/meta.yaml

    Test->>Lint: run lint on corrupted repo
    Lint->>Collector: gather sources_present + findings
    Collector->>FS: read each meta.yaml
    FS-->>Collector: parse failure for corrupted file
    Collector-->>Lint: sources_present (incl. corrupted id), unreadable_source finding
    Lint->>Lint: merge findings, run citation checks using sources_present
    Lint-->>Test: report with unreadable_source finding, ok=False

Poem

A rabbit hopped through meta files at dawn,
Found one gnarled and torn, its YAML withdrawn,
"Unreadable!" it cried, but marked it present still,
So citations stayed calm upon the hill,
One finding added, the lint report drawn.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main change: health lint now surfaces unreadable source meta files as findings.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/lint-unreadable-source

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

src/vouch/health.py (1)
124-159: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider distinguishing parse vs. validation failures, like claim loading does.

_load_claims_for_lint splits failures into invalid_claim (Pydantic ValidationError) and unreadable_claim (other exceptions), but _collect_sources_for_lint folds both YAML parse errors and Source validation errors into a single unreadable_source code. This is a minor asymmetry in the codebase's error-classification convention; not incorrect, since the repair hint (fix the file or re-register) applies either way, but worth aligning for consistency if you want callers to differentiate malformed-YAML vs semantically-invalid-but-parseable metadata.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/health.py` around lines 124 - 159, _align _collect_sources_for_lint
with _load_claims_for_lint by separating YAML parse failures from Source model
validation failures. In the _collect_sources_for_lint loop, keep the existing
Source.model_validate path but catch the YAML loading step and the validation
step separately, emitting distinct Finding codes (for example unreadable_source
vs invalid_source) so callers can tell malformed meta.yaml apart from
semantically invalid metadata. Use the existing _load_claims_for_lint and
Source.model_validate patterns as the reference for the split._

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/vouch/health.py`:
- Around line 124-159: _align _collect_sources_for_lint with
_load_claims_for_lint by separating YAML parse failures from Source model
validation failures. In the _collect_sources_for_lint loop, keep the existing
Source.model_validate path but catch the YAML loading step and the validation
step separately, emitting distinct Finding codes (for example unreadable_source
vs invalid_source) so callers can tell malformed meta.yaml apart from
semantically invalid metadata. Use the existing _load_claims_for_lint and
Source.model_validate patterns as the reference for the split._

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: cc6b599e-7e4b-4a46-bea7-f2a549bfe773

📥 Commits

Reviewing files that changed from the base of the PR and between b671bf1 and c0d74f0.

📒 Files selected for processing (2)

src/vouch/health.py
tests/test_health.py

Copilot

Pull request overview

This PR updates the health/lint sweep to detect and report unreadable sources/<id>/meta.yaml files as explicit lint findings, while still treating the source ID (directory name) as “present” so citation checks don’t incorrectly report broken_citation for a source that exists but can’t be parsed.

Changes:

Add _collect_sources_for_lint() to scan source directories one-by-one, surfacing YAML/validation/read errors as unreadable_source findings.
Update health.lint() to use the new source collection logic for citation integrity checks.
Add a regression test ensuring corrupt source metas become unreadable_source and do not trigger broken_citation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/vouch/health.py`	Adds per-source meta loading for lint and reports unreadable metas as `unreadable_source` while preserving source presence for citation checks.
`tests/test_health.py`	Adds a regression test covering corrupt `meta.yaml` handling and ensuring citation reporting remains accurate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        except Exception as e:
+            findings.append(
+                Finding(
+                    "error",
+                    "unreadable_source",
+                    f"source {sid} ({meta}) could not be loaded: {e} — "
+                    "fix the meta.yaml by hand, or delete the source "
+                    "directory and re-register it",
+                    [sid],
+                )
+            )


Copilot AI review requested due to automatic review settings July 2, 2026 07:40

github-actions Bot added tests tests and fixtures size: S 50-199 changed non-doc lines labels Jul 2, 2026

Copilot started reviewing on behalf of plind-junior July 2, 2026 07:41 View session

plind-junior changed the base branch from main to test July 2, 2026 07:42

plind-junior merged commit cdf3d14 into test Jul 2, 2026
14 of 15 checks passed

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Copilot AI reviewed Jul 2, 2026

View reviewed changes

This was referenced Jul 2, 2026

feat(context): hybrid RRF retrieval backend fusing embedding and fts5 #343

Open

feat(timeline): kb.timeline — entity chronological trajectory (#313) #354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(health): surface unreadable source metas as lint findings#341

fix(health): surface unreadable source metas as lint findings#341
plind-junior merged 1 commit into
testfrom
fix/lint-unreadable-source

plind-junior commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

plind-junior commented Jul 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

What might break

VEP

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

plind-junior commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading