Skip to content

v2.1.0 Candidate#42

Open
caviri wants to merge 222 commits into
mainfrom
develop
Open

v2.1.0 Candidate#42
caviri wants to merge 222 commits into
mainfrom
develop

Conversation

@caviri
Copy link
Copy Markdown
Member

@caviri caviri commented May 22, 2026

No description provided.

- Added new entries to .gitignore for development and internal files.
- Updated `pyproject.toml` to include `logfire` and modified dependency specifications.
- Enhanced `uv.lock` with new package versions and added `babel`, `backrefs`, and `ghp-import` packages.
caviri added 30 commits May 22, 2026 06:34
Extracts and stores extra DSpace fields on OrganizationRecord so the v2
org_resolver can match labs by acronym instead of free-text search.

- acronym (from `oairecerif.acronym`), infoscience_code, unit_code,
  parent_acronym, director_name, org_type_dspace
- in-place schema migration adds the six columns
…ub org classifier

- org_resolver refiner pass: resolves unmatched orgs against a
  pre-fetched federated bundle (Infoscience DuckDB + GitHub + ROR + RAG)
  and reclassifies lab-flavoured Persons to Organizations via org:unitOf
- rule_based agents stamp rich provider metadata as `_`-prefixed internal
  fields (GitHub/ORCID/ROR/Infoscience profile data); stripped before
  SHACL/JSON-LD, surfaced via `?include_internal_fields=true`
- GitHub URL classifier probes the REST API to disambiguate
  user-vs-organization for single-segment URLs
- repo fanout caps for big orgs/users with a references-only fallback
- README fetched via REST (real content, not the 138-byte tagline)
- pulse:owns preserves well-formed external repo IRIs through
  reconciliation and dangling-ref pruning
- promote 6 API-verified institutional orgs into the seed: epfl-vita,
  EPFLiGHT, eth-sri, disco-eth, liri-uzh, irc-hslu
- add CERN-related discovery search terms to the switzerland scope
EXPERIMENTAL — proof-of-concept third v2 runtime alongside `llm` and
`rule_based`. Delegates JSON-LD production to a terminal agent (pi.dev)
operating on a per-repo temporary working directory.

Includes per-entity agent prompts (.pi/agents/), the terminal + subagent
runners, and the pi.dev extension. Not wired into the default pipeline —
committed for review and further experimentation only.
Resolves the 10 v2-ci-gates failures introduced by this branch:

- agent schemas (organization/person/repository) now allow `_`-prefixed
  internal metadata fields via `patternProperties` — they are a
  deliberate feature, stripped before SHACL/JSON-LD
- fanout tests opt into eager owned-repo expansion with
  `V2_EXPAND_OWNED_REPOS=true` (expansion is now opt-in by default)
- `_DummyORCIDProvider` test double implements the `search_persons`
  abstract method
- `test_config` uses a genuinely invalid runtime value (`hybrid` is now
  a valid `AgentRuntime`)
- URL-classifier tests pin `_probe_account_type` so they are
  deterministic and offline; adds coverage for the org-probe path
- new docs/communities-index.md — architecture, config, CLI, schema
- zenodo-index.md: scope now resolves slugs from the communities index
  by parent_org; document community_ids / primary_community_id columns
- rag-indices.md: note the communities support index
- mkdocs nav entry
- .env.example: document V2_EXPAND_OWNED_REPOS
`v2-models-check` flagged drift — the agent schemas gained
`patternProperties` for `^_` internal fields. Only the recorded schema
fingerprints change; the generated model code is unaffected.
The v1 regression tests were moved into `tests/v1/` but the workflow
still referenced the old top-level `tests/test_*.py` paths, so the step
errored with "file or directory not found" (exit 4). This was masked
until now by the earlier v2 pytest failures.
The v1 modules read `GITHUB_TOKEN` from the environment at import time;
without it pytest collection errors with `KeyError: 'GITHUB_TOKEN'`.
Inject the built-in Actions token. Surfaced once the path fix let the
step actually collect.
`test_v1_root_response_shape_is_unchanged` pinned the literal version
`v2.0.1`; the package is now `v3.0.0`. A shape test should assert the
stable title prefix, not a version that goes stale every release.
…sweep

CERN, languagenet (EPFL CS-552), usisoftware-org (USI Lugano). The wider
discover-orgs sweep produced 64 regex-verified candidates, but on
inspection 61 were fuzzy-match false-positives (foreign universities,
unrelated companies, personal accounts) — only these 3 are genuine.
v2 hybrid pipeline: refiners, communities index, index enrichment
…o excluded_entities

- `include_internal_fields` added to `V2ExtractRequest` (POST /v2/extract)
  and wired into `_run_extract_job` — it was missing from the POST path,
  only the GET endpoint supported it
- all five `V2ExtractRequest` fields now carry OpenAPI/Swagger descriptions
- `build_jsonld_output` now strips `_`-prefixed internal fields from
  `excluded_entities` too (previously only `@graph` honoured the flag),
  so `include_internal_fields=false` yields zero `_` fields anywhere and
  `true` keeps them everywhere — a true all-or-nothing toggle
… orchestrator

`V2_EXPAND_OWNED_REPOS=false` was only checked in the orchestrator's
materialisation step. But `context_gather` already iterated every owned
repo of a user/org and ran V1 gimie on each — so submitting a user/org
URL still triggered 30-90 sequential gimie sub-extractions regardless of
the flag, blocking the POST handler for minutes (no job_id returned).

- `should_expand_owned_repos()` is now the single source of truth, defined
  in `context_gather`; the orchestrator imports it (no more divergence)
- the per-repo gimie loops in the user and organization branches of
  `gather_context` are gated by it — when off (default), owned repos are
  kept only as `pulse:owns` references and no gimie runs
- regression test asserts `get_repository` is never called with the flag
  off; existing eager-path tests opt in with `V2_EXPAND_OWNED_REPOS=true`

Verified: `github.com/sdsc-ordes` (88 repos) now returns job_id in 1.5s
and completes in 30ms with zero per-repo gimie calls.
…d-internal-fields

fix(v2): owned-repo gimie iteration + include_internal_fields on POST
`tools/image/Dockerfile` copied `src`, `pyproject.toml` and the gunicorn
conf but never `config/`, so `/app/config` did not exist in the built
image. Every RAG index provider then failed `load_config()` on the
missing `config/index/*.yaml` and silently disabled itself — disciplines,
Infoscience/OpenAlex/ROR/etc. enrichment all went dark.

The 14 index YAMLs are git-tracked code-config; deploy-specific values
(Qdrant URL, tokens, scope) remain env-overridable via INDEX_QDRANT_URL,
RCP_TOKEN, INDEX_*_SCOPE — so baking the configs in does not reduce
redeploy flexibility.
fix(image): copy config/ into the deployment image
`clear()` ran `DELETE FROM responses` with no WHERE clause, which
triggers SQLite's truncate optimization — under it `cursor.rowcount`
returns 0 regardless of how many rows were actually removed. So
`POST /v2/cache/clear` always reported "Cleared 0 entries" even when it
had wiped the whole cache, making it look like the clear was a no-op.

Count the rows before the delete so the returned number is truthful.
fix(v2 cache): report real deleted count from ProviderCache.clear()
`persist_record()` wrote every community a record lists into the
`record_communities` link table, but only the scope bootstrap pass ever
populated the `communities` master table. Records routinely reference
communities outside the crawl scope, so those ids ended up in
`record_communities` with no matching master row — observed 152 orphans
against 3 master rows.

Adds `ZenodoStore.ensure_community()` (insert-if-absent) and calls it for
every referenced community in `persist_record`. `ON CONFLICT DO NOTHING`
means a stub never overwrites richer metadata written by the bootstrap's
`upsert_community`, regardless of ingest order.
fix(zenodo): stop orphaning communities referenced by record_communities
Companion to the orphan fix: `ensure_community` stops *new* orphans, but
existing deployments already have community ids in `record_communities`
with no master row. The `query` subcommand is read-only (SELECT/WITH
only — INSERT is a forbidden keyword), so a one-off SQL fix is not
possible through it.

`python -m src.index.zenodo backfill-communities` selects every
`record_communities.community_id` absent from the `communities` master
and inserts a stub row for each — instant, no network, idempotent.
feat(zenodo): add `backfill-communities` CLI subcommand
`include_internal_fields=true` previously kept the raw `_`-prefixed keys
(`_avatar_url`, `_bio`, …) verbatim. Those are undefined JSON-LD terms,
so any RDF conversion silently drops them — the payload was JSON-visible
but not loadable into a triplestore.

Now each `_x` key is renamed to `gme-internal:x` and the `gme-internal`
prefix (`https://openpulse.science/git-metadata-extractor#`) is
registered in `@context`, so the document expands to real IRI triples.
Applies to both `@graph` nodes and `excluded_entities`.

This is a separate auxiliary vocabulary — it does not touch the Open
Pulse ontology. Such output is valid RDF but intentionally not
conformant to the closed Open Pulse SHACL shapes (a triplestore load
needs no SHACL conformance). Adds `docs/gme-internal.ttl` documenting
the 60 terms.
feat(v2): emit internal fields under the gme-internal RDF namespace
`infer_github_handle_parents` fuzzy-searched ROR and attached the top-5
token-overlap hits per github org, stamping the highest-scoring one as
the `org:unitOf` parent. A real deployment showed this scatter 5
unrelated ROR orgs per handle and pick the wrong parent 9/10 times
(`epfl-lasa` -> NCAR, `imaging-plaza` -> "Plaza Community Services").

- New `ror_parent` refiner: an LLM agent picks the single ROR org that
  is genuinely the parent — applying geography and coincidental-token
  rules — or declines. It can only return a candidate ROR id verbatim,
  never an invented one. Mirrors the repository-refiner pattern.
- `infer_github_handle_parents` is now async + agent-driven: it builds a
  recall-friendly candidate shortlist (also mining institution hints
  from the org's description / homepage), hands it to the selector, and
  inserts ONLY the chosen parent — never sibling matches. Declines leave
  the org standalone: a wrong parent is worse than none. rule_based
  runtime keeps a strict deterministic fallback (LLM-free).
- The org agent's `_select_ror_match` now requires a distinctive (non
  generic) shared token, so coincidental collisions ("Center for Digital
  Trust" <-> "RISM Digital Center") return null and route to the
  selector instead of being finalised.
- Drop stale `pulse:discipline` text from the organization refiner —
  there is no such field on `org:Organization` in Open Pulse v2.1.2.

Live 20-extraction run: wrong ROR 9/10 -> 0/10.
The rule-based person agent only looked up an ORCID when one was already
in the context or the GitHub profile — almost never true — so persons
came back with no ORCID at all (0/10 in a deployment test).

When no hint is available but the person is confidently anchored to an
Infoscience identity, search ORCID by the (now confirmed) name.
`_pick_best_orcid_search_hit` accepts only an UNAMBIGUOUS result: a full
multi-token name match resolving to exactly one ORCID iD, with
affiliation corroboration as the tie-breaker. Without an Infoscience
match we do not guess, to avoid attaching a stranger's ORCID — the same
precision-first stance as `_pick_best_infoscience_match`.

Live run: ORCID resolution 0/10 -> 5/10 (every Infoscience-matched
person), with no false positives on the github-only users.
A deployment report claimed Infoscience-anchored persons lose their
entire `pulse:owns` list (a "fusion bug"). Driving the real downstream
stage functions — in the order `/v2/extract` runs them for a USER root
in hybrid runtime — proves the claim wrong: `pulse:owns` is preserved
identically whether the person's `@id` is an Infoscience or a github
URL, and the rule-based person agent emits the same owns list either
way. A live 20-extraction run confirmed it.

These tests lock that in: if a future change lets the person's
canonical `@id` scheme leak into `pulse:owns` handling, they fail.
GitHub profile READMEs — `<user>/<user>`'s README and `<org>/.github`'s
`profile/README.md` — are the richest free-text "what is this account"
text GitHub exposes, and routinely name the parent institution. They
were never fetched: only repository extractions pulled a README, so an
org/user extraction had nothing but the short API `description`.

- New `GitHubProvider.get_profile_readme(owner, *, is_organization)` —
  base no-op default + a real implementation (user → `<owner>/<owner>`
  README; org → `GET /repos/{owner}/.github/readme/profile`). Cached,
  graceful on the common 404 (most accounts have no profile README).
- The rule-based org and person agents fetch it directly — alongside
  their existing `get_organization` / `get_user` calls — and stamp it as
  the internal `_profile_readme` field.
- The ROR parent selector now receives it: `_org_context_for_selector`
  carries `profile_readme`, and the selector prompt calls out the
  description / profile_readme as decisive evidence for the parent.

Verified against a real org: `get_profile_readme` returns the 1047-char
Imaging-Plaza profile README, and `OrganizationAgentV2.run()` stamps it.
Full v2 suite: 761 passed.
feat(v2): resolve ROR parents with an LLM selector agent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant