Skip to content

fix(health): surface unreadable source metas as lint findings#341

Merged
plind-junior merged 1 commit into
testfrom
fix/lint-unreadable-source
Jul 2, 2026
Merged

fix(health): surface unreadable source metas as lint findings#341
plind-junior merged 1 commit into
testfrom
fix/lint-unreadable-source

Conversation

@plind-junior

@plind-junior plind-junior commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

93f5836 made storage.list_sources() skip artifacts it cannot parse so one corrupt file no longer crashes bulk listings, and pinned utf-8 so a non-utf-8 locale no longer mis-decodes healthy files into mojibake at read time. the skip leaves a gap for files that really are corrupt on disk: a hand edit or an external writer under a mismatched locale can land a raw control character pyyaml rejects, and such a meta then vanishes from the lint sweep with only a log warning while any claim citing the source is misreported as broken_citation.

lint now loads source metas per-file like _load_claims_for_lint does for claims: a meta that fails to parse becomes an unreadable_source error finding with a repair hint, and the source id (its directory name) still counts as present so citation checks point at the actual problem instead of a phantom missing source.

What changed

Why

What might break

VEP

Tests

  • make check passes locally (lint + mypy + pytest)
  • New / changed behaviour has a test
  • CHANGELOG.md updated under ## [Unreleased]

Summary by CodeRabbit

  • Bug Fixes
    • Improved source citation checking so sources with unreadable metadata are now reported clearly instead of being silently ignored.
    • Validation now distinguishes between a missing source and a source whose metadata exists but cannot be parsed, helping keep integrity checks accurate.
    • Lint results now surface these unreadable-source issues directly and avoid false citation errors for sources that are still present.

93f5836 made storage.list_sources() skip artifacts it cannot parse so one corrupt file no longer crashes bulk listings, and pinned utf-8 so a non-utf-8 locale no longer mis-decodes healthy files into mojibake at read time. the skip leaves a gap for files that really are corrupt on disk: a hand edit or an external writer under a mismatched locale can land a raw control character pyyaml rejects, and such a meta then vanishes from the lint sweep with only a log warning while any claim citing the source is misreported as broken_citation.

lint now loads source metas per-file like _load_claims_for_lint does for claims: a meta that fails to parse becomes an unreadable_source error finding with a repair hint, and the source id (its directory name) still counts as present so citation checks point at the actual problem instead of a phantom missing source.
Copilot AI review requested due to automatic review settings July 2, 2026 07:40
@github-actions github-actions Bot added tests tests and fixtures size: S 50-199 changed non-doc lines labels Jul 2, 2026
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The lint() function in health.py now uses a new helper, _collect_sources_for_lint(), to gather present source IDs from sources/*/meta.yaml, adding unreadable_source error findings for meta files that fail to parse while still treating those sources as present for citation checks. A regression test validates this behavior.

Changes

Unreadable Source Meta Lint Handling

Layer / File(s) Summary
Collect sources and surface unreadable_source findings
src/vouch/health.py
Adds Source import and _collect_sources_for_lint() helper that scans source meta files, records present sources, emits unreadable_source error findings for unparseable meta.yaml, and merges these into lint()'s findings and sources_present set.
Regression test for unreadable source meta
tests/test_health.py
Adds a test corrupting a source's meta.yaml, asserting lint reports unreadable_source, marks report ok is False, and does not report broken_citation for the still-present source.

Estimated code review effort: 2 (Simple) | ~10 minutes

Sequence Diagram(s)

sequenceDiagram
    participant Test as test_health.py
    participant Lint as health.lint()
    participant Collector as _collect_sources_for_lint()
    participant FS as sources/*/meta.yaml

    Test->>Lint: run lint on corrupted repo
    Lint->>Collector: gather sources_present + findings
    Collector->>FS: read each meta.yaml
    FS-->>Collector: parse failure for corrupted file
    Collector-->>Lint: sources_present (incl. corrupted id), unreadable_source finding
    Lint->>Lint: merge findings, run citation checks using sources_present
    Lint-->>Test: report with unreadable_source finding, ok=False
Loading

Poem

  • A rabbit hopped through meta files at dawn,
  • Found one gnarled and torn, its YAML withdrawn,
  • "Unreadable!" it cried, but marked it present still,
  • So citations stayed calm upon the hill,
  • One finding added, the lint report drawn.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: health lint now surfaces unreadable source meta files as findings.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/lint-unreadable-source

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@plind-junior plind-junior changed the base branch from main to test July 2, 2026 07:42
@plind-junior plind-junior merged commit cdf3d14 into test Jul 2, 2026
14 of 15 checks passed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/vouch/health.py (1)

124-159: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider distinguishing parse vs. validation failures, like claim loading does.

_load_claims_for_lint splits failures into invalid_claim (Pydantic ValidationError) and unreadable_claim (other exceptions), but _collect_sources_for_lint folds both YAML parse errors and Source validation errors into a single unreadable_source code. This is a minor asymmetry in the codebase's error-classification convention; not incorrect, since the repair hint (fix the file or re-register) applies either way, but worth aligning for consistency if you want callers to differentiate malformed-YAML vs semantically-invalid-but-parseable metadata.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vouch/health.py` around lines 124 - 159, _align _collect_sources_for_lint
with _load_claims_for_lint by separating YAML parse failures from Source model
validation failures. In the _collect_sources_for_lint loop, keep the existing
Source.model_validate path but catch the YAML loading step and the validation
step separately, emitting distinct Finding codes (for example unreadable_source
vs invalid_source) so callers can tell malformed meta.yaml apart from
semantically invalid metadata. Use the existing _load_claims_for_lint and
Source.model_validate patterns as the reference for the split._
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/vouch/health.py`:
- Around line 124-159: _align _collect_sources_for_lint with
_load_claims_for_lint by separating YAML parse failures from Source model
validation failures. In the _collect_sources_for_lint loop, keep the existing
Source.model_validate path but catch the YAML loading step and the validation
step separately, emitting distinct Finding codes (for example unreadable_source
vs invalid_source) so callers can tell malformed meta.yaml apart from
semantically invalid metadata. Use the existing _load_claims_for_lint and
Source.model_validate patterns as the reference for the split._

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: cc6b599e-7e4b-4a46-bea7-f2a549bfe773

📥 Commits

Reviewing files that changed from the base of the PR and between b671bf1 and c0d74f0.

📒 Files selected for processing (2)
  • src/vouch/health.py
  • tests/test_health.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the health/lint sweep to detect and report unreadable sources/<id>/meta.yaml files as explicit lint findings, while still treating the source ID (directory name) as “present” so citation checks don’t incorrectly report broken_citation for a source that exists but can’t be parsed.

Changes:

  • Add _collect_sources_for_lint() to scan source directories one-by-one, surfacing YAML/validation/read errors as unreadable_source findings.
  • Update health.lint() to use the new source collection logic for citation integrity checks.
  • Add a regression test ensuring corrupt source metas become unreadable_source and do not trigger broken_citation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/vouch/health.py Adds per-source meta loading for lint and reports unreadable metas as unreadable_source while preserving source presence for citation checks.
tests/test_health.py Adds a regression test covering corrupt meta.yaml handling and ensuring citation reporting remains accurate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/vouch/health.py
Comment on lines +148 to +158
except Exception as e:
findings.append(
Finding(
"error",
"unreadable_source",
f"source {sid} ({meta}) could not be loaded: {e} — "
"fix the meta.yaml by hand, or delete the source "
"directory and re-register it",
[sid],
)
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size: S 50-199 changed non-doc lines tests tests and fixtures

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants