fix(health): surface unreadable source metas as lint findings#341
Conversation
93f5836 made storage.list_sources() skip artifacts it cannot parse so one corrupt file no longer crashes bulk listings, and pinned utf-8 so a non-utf-8 locale no longer mis-decodes healthy files into mojibake at read time. the skip leaves a gap for files that really are corrupt on disk: a hand edit or an external writer under a mismatched locale can land a raw control character pyyaml rejects, and such a meta then vanishes from the lint sweep with only a log warning while any claim citing the source is misreported as broken_citation. lint now loads source metas per-file like _load_claims_for_lint does for claims: a meta that fails to parse becomes an unreadable_source error finding with a repair hint, and the source id (its directory name) still counts as present so citation checks point at the actual problem instead of a phantom missing source.
📝 WalkthroughWalkthroughThe lint() function in health.py now uses a new helper, ChangesUnreadable Source Meta Lint Handling
Estimated code review effort: 2 (Simple) | ~10 minutes Sequence Diagram(s)sequenceDiagram
participant Test as test_health.py
participant Lint as health.lint()
participant Collector as _collect_sources_for_lint()
participant FS as sources/*/meta.yaml
Test->>Lint: run lint on corrupted repo
Lint->>Collector: gather sources_present + findings
Collector->>FS: read each meta.yaml
FS-->>Collector: parse failure for corrupted file
Collector-->>Lint: sources_present (incl. corrupted id), unreadable_source finding
Lint->>Lint: merge findings, run citation checks using sources_present
Lint-->>Test: report with unreadable_source finding, ok=False
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/vouch/health.py (1)
124-159: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueConsider distinguishing parse vs. validation failures, like claim loading does.
_load_claims_for_lintsplits failures intoinvalid_claim(PydanticValidationError) andunreadable_claim(other exceptions), but_collect_sources_for_lintfolds both YAML parse errors andSourcevalidation errors into a singleunreadable_sourcecode. This is a minor asymmetry in the codebase's error-classification convention; not incorrect, since the repair hint (fix the file or re-register) applies either way, but worth aligning for consistency if you want callers to differentiate malformed-YAML vs semantically-invalid-but-parseable metadata.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vouch/health.py` around lines 124 - 159, _align _collect_sources_for_lint with _load_claims_for_lint by separating YAML parse failures from Source model validation failures. In the _collect_sources_for_lint loop, keep the existing Source.model_validate path but catch the YAML loading step and the validation step separately, emitting distinct Finding codes (for example unreadable_source vs invalid_source) so callers can tell malformed meta.yaml apart from semantically invalid metadata. Use the existing _load_claims_for_lint and Source.model_validate patterns as the reference for the split._
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/vouch/health.py`:
- Around line 124-159: _align _collect_sources_for_lint with
_load_claims_for_lint by separating YAML parse failures from Source model
validation failures. In the _collect_sources_for_lint loop, keep the existing
Source.model_validate path but catch the YAML loading step and the validation
step separately, emitting distinct Finding codes (for example unreadable_source
vs invalid_source) so callers can tell malformed meta.yaml apart from
semantically invalid metadata. Use the existing _load_claims_for_lint and
Source.model_validate patterns as the reference for the split._
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: cc6b599e-7e4b-4a46-bea7-f2a549bfe773
📒 Files selected for processing (2)
src/vouch/health.pytests/test_health.py
There was a problem hiding this comment.
Pull request overview
This PR updates the health/lint sweep to detect and report unreadable sources/<id>/meta.yaml files as explicit lint findings, while still treating the source ID (directory name) as “present” so citation checks don’t incorrectly report broken_citation for a source that exists but can’t be parsed.
Changes:
- Add
_collect_sources_for_lint()to scan source directories one-by-one, surfacing YAML/validation/read errors asunreadable_sourcefindings. - Update
health.lint()to use the new source collection logic for citation integrity checks. - Add a regression test ensuring corrupt source metas become
unreadable_sourceand do not triggerbroken_citation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/vouch/health.py |
Adds per-source meta loading for lint and reports unreadable metas as unreadable_source while preserving source presence for citation checks. |
tests/test_health.py |
Adds a regression test covering corrupt meta.yaml handling and ensuring citation reporting remains accurate. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| except Exception as e: | ||
| findings.append( | ||
| Finding( | ||
| "error", | ||
| "unreadable_source", | ||
| f"source {sid} ({meta}) could not be loaded: {e} — " | ||
| "fix the meta.yaml by hand, or delete the source " | ||
| "directory and re-register it", | ||
| [sid], | ||
| ) | ||
| ) |
93f5836 made storage.list_sources() skip artifacts it cannot parse so one corrupt file no longer crashes bulk listings, and pinned utf-8 so a non-utf-8 locale no longer mis-decodes healthy files into mojibake at read time. the skip leaves a gap for files that really are corrupt on disk: a hand edit or an external writer under a mismatched locale can land a raw control character pyyaml rejects, and such a meta then vanishes from the lint sweep with only a log warning while any claim citing the source is misreported as broken_citation.
lint now loads source metas per-file like _load_claims_for_lint does for claims: a meta that fails to parse becomes an unreadable_source error finding with a repair hint, and the source id (its directory name) still counts as present so citation checks point at the actual problem instead of a phantom missing source.
What changed
Why
What might break
VEP
Tests
make checkpasses locally (lint + mypy + pytest)CHANGELOG.mdupdated under## [Unreleased]Summary by CodeRabbit