v2.1.0 Candidate by caviri · Pull Request #42 · Imaging-Plaza/git-metadata-extractor

caviri · 2026-05-22T08:25:01Z

No description provided.

- Added new entries to .gitignore for development and internal files. - Updated `pyproject.toml` to include `logfire` and modified dependency specifications. - Enhanced `uv.lock` with new package versions and added `babel`, `backrefs`, and `ghp-import` packages.

Extracts and stores extra DSpace fields on OrganizationRecord so the v2 org_resolver can match labs by acronym instead of free-text search. - acronym (from `oairecerif.acronym`), infoscience_code, unit_code, parent_acronym, director_name, org_type_dspace - in-place schema migration adds the six columns

…ub org classifier - org_resolver refiner pass: resolves unmatched orgs against a pre-fetched federated bundle (Infoscience DuckDB + GitHub + ROR + RAG) and reclassifies lab-flavoured Persons to Organizations via org:unitOf - rule_based agents stamp rich provider metadata as `_`-prefixed internal fields (GitHub/ORCID/ROR/Infoscience profile data); stripped before SHACL/JSON-LD, surfaced via `?include_internal_fields=true` - GitHub URL classifier probes the REST API to disambiguate user-vs-organization for single-segment URLs - repo fanout caps for big orgs/users with a references-only fallback - README fetched via REST (real content, not the 138-byte tagline) - pulse:owns preserves well-formed external repo IRIs through reconciliation and dangling-ref pruning

- promote 6 API-verified institutional orgs into the seed: epfl-vita, EPFLiGHT, eth-sri, disco-eth, liri-uzh, irc-hslu - add CERN-related discovery search terms to the switzerland scope

EXPERIMENTAL — proof-of-concept third v2 runtime alongside `llm` and `rule_based`. Delegates JSON-LD production to a terminal agent (pi.dev) operating on a per-repo temporary working directory. Includes per-entity agent prompts (.pi/agents/), the terminal + subagent runners, and the pi.dev extension. Not wired into the default pipeline — committed for review and further experimentation only.

Resolves the 10 v2-ci-gates failures introduced by this branch: - agent schemas (organization/person/repository) now allow `_`-prefixed internal metadata fields via `patternProperties` — they are a deliberate feature, stripped before SHACL/JSON-LD - fanout tests opt into eager owned-repo expansion with `V2_EXPAND_OWNED_REPOS=true` (expansion is now opt-in by default) - `_DummyORCIDProvider` test double implements the `search_persons` abstract method - `test_config` uses a genuinely invalid runtime value (`hybrid` is now a valid `AgentRuntime`) - URL-classifier tests pin `_probe_account_type` so they are deterministic and offline; adds coverage for the org-probe path

- new docs/communities-index.md — architecture, config, CLI, schema - zenodo-index.md: scope now resolves slugs from the communities index by parent_org; document community_ids / primary_community_id columns - rag-indices.md: note the communities support index - mkdocs nav entry - .env.example: document V2_EXPAND_OWNED_REPOS

`v2-models-check` flagged drift — the agent schemas gained `patternProperties` for `^_` internal fields. Only the recorded schema fingerprints change; the generated model code is unaffected.

The v1 regression tests were moved into `tests/v1/` but the workflow still referenced the old top-level `tests/test_*.py` paths, so the step errored with "file or directory not found" (exit 4). This was masked until now by the earlier v2 pytest failures.

The v1 modules read `GITHUB_TOKEN` from the environment at import time; without it pytest collection errors with `KeyError: 'GITHUB_TOKEN'`. Inject the built-in Actions token. Surfaced once the path fix let the step actually collect.

`test_v1_root_response_shape_is_unchanged` pinned the literal version `v2.0.1`; the package is now `v3.0.0`. A shape test should assert the stable title prefix, not a version that goes stale every release.

…sweep CERN, languagenet (EPFL CS-552), usisoftware-org (USI Lugano). The wider discover-orgs sweep produced 64 regex-verified candidates, but on inspection 61 were fuzzy-match false-positives (foreign universities, unrelated companies, personal accounts) — only these 3 are genuine.

v2 hybrid pipeline: refiners, communities index, index enrichment

…o excluded_entities - `include_internal_fields` added to `V2ExtractRequest` (POST /v2/extract) and wired into `_run_extract_job` — it was missing from the POST path, only the GET endpoint supported it - all five `V2ExtractRequest` fields now carry OpenAPI/Swagger descriptions - `build_jsonld_output` now strips `_`-prefixed internal fields from `excluded_entities` too (previously only `@graph` honoured the flag), so `include_internal_fields=false` yields zero `_` fields anywhere and `true` keeps them everywhere — a true all-or-nothing toggle

… orchestrator `V2_EXPAND_OWNED_REPOS=false` was only checked in the orchestrator's materialisation step. But `context_gather` already iterated every owned repo of a user/org and ran V1 gimie on each — so submitting a user/org URL still triggered 30-90 sequential gimie sub-extractions regardless of the flag, blocking the POST handler for minutes (no job_id returned). - `should_expand_owned_repos()` is now the single source of truth, defined in `context_gather`; the orchestrator imports it (no more divergence) - the per-repo gimie loops in the user and organization branches of `gather_context` are gated by it — when off (default), owned repos are kept only as `pulse:owns` references and no gimie runs - regression test asserts `get_repository` is never called with the flag off; existing eager-path tests opt in with `V2_EXPAND_OWNED_REPOS=true` Verified: `github.com/sdsc-ordes` (88 repos) now returns job_id in 1.5s and completes in 30ms with zero per-repo gimie calls.

…d-internal-fields fix(v2): owned-repo gimie iteration + include_internal_fields on POST

`tools/image/Dockerfile` copied `src`, `pyproject.toml` and the gunicorn conf but never `config/`, so `/app/config` did not exist in the built image. Every RAG index provider then failed `load_config()` on the missing `config/index/*.yaml` and silently disabled itself — disciplines, Infoscience/OpenAlex/ROR/etc. enrichment all went dark. The 14 index YAMLs are git-tracked code-config; deploy-specific values (Qdrant URL, tokens, scope) remain env-overridable via INDEX_QDRANT_URL, RCP_TOKEN, INDEX_*_SCOPE — so baking the configs in does not reduce redeploy flexibility.

fix(image): copy config/ into the deployment image

`clear()` ran `DELETE FROM responses` with no WHERE clause, which triggers SQLite's truncate optimization — under it `cursor.rowcount` returns 0 regardless of how many rows were actually removed. So `POST /v2/cache/clear` always reported "Cleared 0 entries" even when it had wiped the whole cache, making it look like the clear was a no-op. Count the rows before the delete so the returned number is truthful.

fix(v2 cache): report real deleted count from ProviderCache.clear()

`persist_record()` wrote every community a record lists into the `record_communities` link table, but only the scope bootstrap pass ever populated the `communities` master table. Records routinely reference communities outside the crawl scope, so those ids ended up in `record_communities` with no matching master row — observed 152 orphans against 3 master rows. Adds `ZenodoStore.ensure_community()` (insert-if-absent) and calls it for every referenced community in `persist_record`. `ON CONFLICT DO NOTHING` means a stub never overwrites richer metadata written by the bootstrap's `upsert_community`, regardless of ingest order.

fix(zenodo): stop orphaning communities referenced by record_communities

Companion to the orphan fix: `ensure_community` stops *new* orphans, but existing deployments already have community ids in `record_communities` with no master row. The `query` subcommand is read-only (SELECT/WITH only — INSERT is a forbidden keyword), so a one-off SQL fix is not possible through it. `python -m src.index.zenodo backfill-communities` selects every `record_communities.community_id` absent from the `communities` master and inserts a stub row for each — instant, no network, idempotent.

feat(zenodo): add `backfill-communities` CLI subcommand

`include_internal_fields=true` previously kept the raw `_`-prefixed keys (`_avatar_url`, `_bio`, …) verbatim. Those are undefined JSON-LD terms, so any RDF conversion silently drops them — the payload was JSON-visible but not loadable into a triplestore. Now each `_x` key is renamed to `gme-internal:x` and the `gme-internal` prefix (`https://openpulse.science/git-metadata-extractor#`) is registered in `@context`, so the document expands to real IRI triples. Applies to both `@graph` nodes and `excluded_entities`. This is a separate auxiliary vocabulary — it does not touch the Open Pulse ontology. Such output is valid RDF but intentionally not conformant to the closed Open Pulse SHACL shapes (a triplestore load needs no SHACL conformance). Adds `docs/gme-internal.ttl` documenting the 60 terms.

feat(v2): emit internal fields under the gme-internal RDF namespace

`infer_github_handle_parents` fuzzy-searched ROR and attached the top-5 token-overlap hits per github org, stamping the highest-scoring one as the `org:unitOf` parent. A real deployment showed this scatter 5 unrelated ROR orgs per handle and pick the wrong parent 9/10 times (`epfl-lasa` -> NCAR, `imaging-plaza` -> "Plaza Community Services"). - New `ror_parent` refiner: an LLM agent picks the single ROR org that is genuinely the parent — applying geography and coincidental-token rules — or declines. It can only return a candidate ROR id verbatim, never an invented one. Mirrors the repository-refiner pattern. - `infer_github_handle_parents` is now async + agent-driven: it builds a recall-friendly candidate shortlist (also mining institution hints from the org's description / homepage), hands it to the selector, and inserts ONLY the chosen parent — never sibling matches. Declines leave the org standalone: a wrong parent is worse than none. rule_based runtime keeps a strict deterministic fallback (LLM-free). - The org agent's `_select_ror_match` now requires a distinctive (non generic) shared token, so coincidental collisions ("Center for Digital Trust" <-> "RISM Digital Center") return null and route to the selector instead of being finalised. - Drop stale `pulse:discipline` text from the organization refiner — there is no such field on `org:Organization` in Open Pulse v2.1.2. Live 20-extraction run: wrong ROR 9/10 -> 0/10.

The rule-based person agent only looked up an ORCID when one was already in the context or the GitHub profile — almost never true — so persons came back with no ORCID at all (0/10 in a deployment test). When no hint is available but the person is confidently anchored to an Infoscience identity, search ORCID by the (now confirmed) name. `_pick_best_orcid_search_hit` accepts only an UNAMBIGUOUS result: a full multi-token name match resolving to exactly one ORCID iD, with affiliation corroboration as the tie-breaker. Without an Infoscience match we do not guess, to avoid attaching a stranger's ORCID — the same precision-first stance as `_pick_best_infoscience_match`. Live run: ORCID resolution 0/10 -> 5/10 (every Infoscience-matched person), with no false positives on the github-only users.

A deployment report claimed Infoscience-anchored persons lose their entire `pulse:owns` list (a "fusion bug"). Driving the real downstream stage functions — in the order `/v2/extract` runs them for a USER root in hybrid runtime — proves the claim wrong: `pulse:owns` is preserved identically whether the person's `@id` is an Infoscience or a github URL, and the rule-based person agent emits the same owns list either way. A live 20-extraction run confirmed it. These tests lock that in: if a future change lets the person's canonical `@id` scheme leak into `pulse:owns` handling, they fail.

GitHub profile READMEs — `<user>/<user>`'s README and `<org>/.github`'s `profile/README.md` — are the richest free-text "what is this account" text GitHub exposes, and routinely name the parent institution. They were never fetched: only repository extractions pulled a README, so an org/user extraction had nothing but the short API `description`. - New `GitHubProvider.get_profile_readme(owner, *, is_organization)` — base no-op default + a real implementation (user → `<owner>/<owner>` README; org → `GET /repos/{owner}/.github/readme/profile`). Cached, graceful on the common 404 (most accounts have no profile README). - The rule-based org and person agents fetch it directly — alongside their existing `get_organization` / `get_user` calls — and stamp it as the internal `_profile_readme` field. - The ROR parent selector now receives it: `_org_context_for_selector` carries `profile_readme`, and the selector prompt calls out the description / profile_readme as decisive evidence for the parent. Verified against a real org: `get_profile_readme` returns the 1047-char Imaging-Plaza profile README, and `OrganizationAgentV2.run()` stamps it. Full v2 suite: 761 passed.

feat(v2): resolve ROR parents with an LLM selector agent

caviri added 30 commits February 23, 2026 20:25

feat(v2): implement P0-01 strict schema promotion

4e7c2c2

feat(v2): implement P0-02 agent schema promotion

e60a3ca

feat(v2): implement P0-03 test infrastructure

8fa9881

chore: Remove organization enrichment tests from the test suite

5f3d34c

feat(v2): implement P0-04 strict schema validation tests

f7cd949

test(v2): implement P0-05 agent schema valid-fixture checks

35f1751

test(v2): add P0-06 strict negative schema validation

911c3fd

feat(v2): add mock github provider fixtures and interface

7e6f4cb

feat(v2): add mock infoscience and ror providers

26be017

feat(v2): add deterministic mock dataset generator

9c8a15e

chore(v2): normalize mock generator constants to ascii

3fb2fe8

feat(v2): add mock ORCID provider for P0-08

9da90b3

feat(v2): add cross-reference validation for P0-12

5b97d65

test(v2): add red-phase golden tests for P0-13 and P0-14

2c20d23

docs(v2): advance entry task and log P0-08/P0-12/P0-14

c910e35

feat(v2): scaffold phase-1 package skeleton

48100e5

docs(v2): advance task pointer and record p1-01 validation

ee27544

feat(v2): add config module and github url classifier

1135813

docs(v2): advance phase-1 task tracker and changelog

fcd687d

feat(v2): add P1-05 response contracts

2a74c71

feat(v2): add P1-06 error models

a0a9dc1

feat(v2): implement P1-07 and P1-08 stub endpoints

54b68b3

docs(v2): advance entry task and log P1-05 to P1-08

ed2de49

feat(v2): mount v2 router in main api

62f6f6e

feat(v2): add v2 health check endpoint

0f208ad

docs(v2): advance entry task and log P1-09 P1-10

e4a8ea7

feat(v2): implement phase-2 provider interfaces and agent wrappers

56dfc87

docs(v2): advance phase entry task and log phase-2 testing

d80dda5

feat(v2): implement phase 2 tasks p2-05 through p2-09

44d5ebd

caviri added 30 commits May 22, 2026 06:34

feat(huggingface): expand Swiss org seed + CERN discovery tokens

23442e5

- promote 6 API-verified institutional orgs into the seed: epfl-vita, EPFLiGHT, eth-sri, disco-eth, liri-uzh, irc-hslu - add CERN-related discovery search terms to the switzerland scope

fix(v2): regenerate agent models after schema patternProperties change

cc7e05e

`v2-models-check` flagged drift — the agent schemas gained `patternProperties` for `^_` internal fields. Only the recorded schema fingerprints change; the generated model code is unaffected.

fix(v1 tests): make root-response parity check version-agnostic

437dbc7

`test_v1_root_response_shape_is_unchanged` pinned the literal version `v2.0.1`; the package is now `v3.0.0`. A shape test should assert the stable title prefix, not a version that goes stale every release.

Merge pull request #41 from Imaging-Plaza/feat/scout-usage-limits-bump

e4dc139

v2 hybrid pipeline: refiners, communities index, index enrichment

Merge pull request #43 from Imaging-Plaza/fix/v2-owned-repo-fanout-an…

f46b811

…d-internal-fields fix(v2): owned-repo gimie iteration + include_internal_fields on POST

Merge pull request #45 from Imaging-Plaza/fix/image-copy-config-dir

9467985

fix(image): copy config/ into the deployment image

Merge pull request #46 from Imaging-Plaza/fix/provider-cache-clear-count

94e4fb0

fix(v2 cache): report real deleted count from ProviderCache.clear()

Merge pull request #47 from Imaging-Plaza/fix/zenodo-orphan-communities

415b147

fix(zenodo): stop orphaning communities referenced by record_communities

Merge pull request #48 from Imaging-Plaza/fix/zenodo-orphan-communities

5e3272b

feat(zenodo): add `backfill-communities` CLI subcommand

Merge pull request #49 from Imaging-Plaza/feat/gme-internal-namespace

b7f7023

feat(v2): emit internal fields under the gme-internal RDF namespace

Merge pull request #50 from Imaging-Plaza/fix/users-orgs-extractor

607595e

feat(v2): resolve ROR parents with an LLM selector agent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.1.0 Candidate#42

v2.1.0 Candidate#42
caviri wants to merge 222 commits into
mainfrom
develop

caviri commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

caviri commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant