Conversation
- Added new entries to .gitignore for development and internal files. - Updated `pyproject.toml` to include `logfire` and modified dependency specifications. - Enhanced `uv.lock` with new package versions and added `babel`, `backrefs`, and `ghp-import` packages.
Extracts and stores extra DSpace fields on OrganizationRecord so the v2 org_resolver can match labs by acronym instead of free-text search. - acronym (from `oairecerif.acronym`), infoscience_code, unit_code, parent_acronym, director_name, org_type_dspace - in-place schema migration adds the six columns
…ub org classifier - org_resolver refiner pass: resolves unmatched orgs against a pre-fetched federated bundle (Infoscience DuckDB + GitHub + ROR + RAG) and reclassifies lab-flavoured Persons to Organizations via org:unitOf - rule_based agents stamp rich provider metadata as `_`-prefixed internal fields (GitHub/ORCID/ROR/Infoscience profile data); stripped before SHACL/JSON-LD, surfaced via `?include_internal_fields=true` - GitHub URL classifier probes the REST API to disambiguate user-vs-organization for single-segment URLs - repo fanout caps for big orgs/users with a references-only fallback - README fetched via REST (real content, not the 138-byte tagline) - pulse:owns preserves well-formed external repo IRIs through reconciliation and dangling-ref pruning
- promote 6 API-verified institutional orgs into the seed: epfl-vita, EPFLiGHT, eth-sri, disco-eth, liri-uzh, irc-hslu - add CERN-related discovery search terms to the switzerland scope
EXPERIMENTAL — proof-of-concept third v2 runtime alongside `llm` and `rule_based`. Delegates JSON-LD production to a terminal agent (pi.dev) operating on a per-repo temporary working directory. Includes per-entity agent prompts (.pi/agents/), the terminal + subagent runners, and the pi.dev extension. Not wired into the default pipeline — committed for review and further experimentation only.
Resolves the 10 v2-ci-gates failures introduced by this branch: - agent schemas (organization/person/repository) now allow `_`-prefixed internal metadata fields via `patternProperties` — they are a deliberate feature, stripped before SHACL/JSON-LD - fanout tests opt into eager owned-repo expansion with `V2_EXPAND_OWNED_REPOS=true` (expansion is now opt-in by default) - `_DummyORCIDProvider` test double implements the `search_persons` abstract method - `test_config` uses a genuinely invalid runtime value (`hybrid` is now a valid `AgentRuntime`) - URL-classifier tests pin `_probe_account_type` so they are deterministic and offline; adds coverage for the org-probe path
- new docs/communities-index.md — architecture, config, CLI, schema - zenodo-index.md: scope now resolves slugs from the communities index by parent_org; document community_ids / primary_community_id columns - rag-indices.md: note the communities support index - mkdocs nav entry - .env.example: document V2_EXPAND_OWNED_REPOS
`v2-models-check` flagged drift — the agent schemas gained `patternProperties` for `^_` internal fields. Only the recorded schema fingerprints change; the generated model code is unaffected.
The v1 regression tests were moved into `tests/v1/` but the workflow still referenced the old top-level `tests/test_*.py` paths, so the step errored with "file or directory not found" (exit 4). This was masked until now by the earlier v2 pytest failures.
The v1 modules read `GITHUB_TOKEN` from the environment at import time; without it pytest collection errors with `KeyError: 'GITHUB_TOKEN'`. Inject the built-in Actions token. Surfaced once the path fix let the step actually collect.
`test_v1_root_response_shape_is_unchanged` pinned the literal version `v2.0.1`; the package is now `v3.0.0`. A shape test should assert the stable title prefix, not a version that goes stale every release.
…sweep CERN, languagenet (EPFL CS-552), usisoftware-org (USI Lugano). The wider discover-orgs sweep produced 64 regex-verified candidates, but on inspection 61 were fuzzy-match false-positives (foreign universities, unrelated companies, personal accounts) — only these 3 are genuine.
v2 hybrid pipeline: refiners, communities index, index enrichment
…o excluded_entities - `include_internal_fields` added to `V2ExtractRequest` (POST /v2/extract) and wired into `_run_extract_job` — it was missing from the POST path, only the GET endpoint supported it - all five `V2ExtractRequest` fields now carry OpenAPI/Swagger descriptions - `build_jsonld_output` now strips `_`-prefixed internal fields from `excluded_entities` too (previously only `@graph` honoured the flag), so `include_internal_fields=false` yields zero `_` fields anywhere and `true` keeps them everywhere — a true all-or-nothing toggle
… orchestrator `V2_EXPAND_OWNED_REPOS=false` was only checked in the orchestrator's materialisation step. But `context_gather` already iterated every owned repo of a user/org and ran V1 gimie on each — so submitting a user/org URL still triggered 30-90 sequential gimie sub-extractions regardless of the flag, blocking the POST handler for minutes (no job_id returned). - `should_expand_owned_repos()` is now the single source of truth, defined in `context_gather`; the orchestrator imports it (no more divergence) - the per-repo gimie loops in the user and organization branches of `gather_context` are gated by it — when off (default), owned repos are kept only as `pulse:owns` references and no gimie runs - regression test asserts `get_repository` is never called with the flag off; existing eager-path tests opt in with `V2_EXPAND_OWNED_REPOS=true` Verified: `github.com/sdsc-ordes` (88 repos) now returns job_id in 1.5s and completes in 30ms with zero per-repo gimie calls.
…d-internal-fields fix(v2): owned-repo gimie iteration + include_internal_fields on POST
`tools/image/Dockerfile` copied `src`, `pyproject.toml` and the gunicorn conf but never `config/`, so `/app/config` did not exist in the built image. Every RAG index provider then failed `load_config()` on the missing `config/index/*.yaml` and silently disabled itself — disciplines, Infoscience/OpenAlex/ROR/etc. enrichment all went dark. The 14 index YAMLs are git-tracked code-config; deploy-specific values (Qdrant URL, tokens, scope) remain env-overridable via INDEX_QDRANT_URL, RCP_TOKEN, INDEX_*_SCOPE — so baking the configs in does not reduce redeploy flexibility.
fix(image): copy config/ into the deployment image
`clear()` ran `DELETE FROM responses` with no WHERE clause, which triggers SQLite's truncate optimization — under it `cursor.rowcount` returns 0 regardless of how many rows were actually removed. So `POST /v2/cache/clear` always reported "Cleared 0 entries" even when it had wiped the whole cache, making it look like the clear was a no-op. Count the rows before the delete so the returned number is truthful.
fix(v2 cache): report real deleted count from ProviderCache.clear()
`persist_record()` wrote every community a record lists into the `record_communities` link table, but only the scope bootstrap pass ever populated the `communities` master table. Records routinely reference communities outside the crawl scope, so those ids ended up in `record_communities` with no matching master row — observed 152 orphans against 3 master rows. Adds `ZenodoStore.ensure_community()` (insert-if-absent) and calls it for every referenced community in `persist_record`. `ON CONFLICT DO NOTHING` means a stub never overwrites richer metadata written by the bootstrap's `upsert_community`, regardless of ingest order.
fix(zenodo): stop orphaning communities referenced by record_communities
Companion to the orphan fix: `ensure_community` stops *new* orphans, but existing deployments already have community ids in `record_communities` with no master row. The `query` subcommand is read-only (SELECT/WITH only — INSERT is a forbidden keyword), so a one-off SQL fix is not possible through it. `python -m src.index.zenodo backfill-communities` selects every `record_communities.community_id` absent from the `communities` master and inserts a stub row for each — instant, no network, idempotent.
feat(zenodo): add `backfill-communities` CLI subcommand
`include_internal_fields=true` previously kept the raw `_`-prefixed keys (`_avatar_url`, `_bio`, …) verbatim. Those are undefined JSON-LD terms, so any RDF conversion silently drops them — the payload was JSON-visible but not loadable into a triplestore. Now each `_x` key is renamed to `gme-internal:x` and the `gme-internal` prefix (`https://openpulse.science/git-metadata-extractor#`) is registered in `@context`, so the document expands to real IRI triples. Applies to both `@graph` nodes and `excluded_entities`. This is a separate auxiliary vocabulary — it does not touch the Open Pulse ontology. Such output is valid RDF but intentionally not conformant to the closed Open Pulse SHACL shapes (a triplestore load needs no SHACL conformance). Adds `docs/gme-internal.ttl` documenting the 60 terms.
feat(v2): emit internal fields under the gme-internal RDF namespace
`infer_github_handle_parents` fuzzy-searched ROR and attached the top-5
token-overlap hits per github org, stamping the highest-scoring one as
the `org:unitOf` parent. A real deployment showed this scatter 5
unrelated ROR orgs per handle and pick the wrong parent 9/10 times
(`epfl-lasa` -> NCAR, `imaging-plaza` -> "Plaza Community Services").
- New `ror_parent` refiner: an LLM agent picks the single ROR org that
is genuinely the parent — applying geography and coincidental-token
rules — or declines. It can only return a candidate ROR id verbatim,
never an invented one. Mirrors the repository-refiner pattern.
- `infer_github_handle_parents` is now async + agent-driven: it builds a
recall-friendly candidate shortlist (also mining institution hints
from the org's description / homepage), hands it to the selector, and
inserts ONLY the chosen parent — never sibling matches. Declines leave
the org standalone: a wrong parent is worse than none. rule_based
runtime keeps a strict deterministic fallback (LLM-free).
- The org agent's `_select_ror_match` now requires a distinctive (non
generic) shared token, so coincidental collisions ("Center for Digital
Trust" <-> "RISM Digital Center") return null and route to the
selector instead of being finalised.
- Drop stale `pulse:discipline` text from the organization refiner —
there is no such field on `org:Organization` in Open Pulse v2.1.2.
Live 20-extraction run: wrong ROR 9/10 -> 0/10.
The rule-based person agent only looked up an ORCID when one was already in the context or the GitHub profile — almost never true — so persons came back with no ORCID at all (0/10 in a deployment test). When no hint is available but the person is confidently anchored to an Infoscience identity, search ORCID by the (now confirmed) name. `_pick_best_orcid_search_hit` accepts only an UNAMBIGUOUS result: a full multi-token name match resolving to exactly one ORCID iD, with affiliation corroboration as the tie-breaker. Without an Infoscience match we do not guess, to avoid attaching a stranger's ORCID — the same precision-first stance as `_pick_best_infoscience_match`. Live run: ORCID resolution 0/10 -> 5/10 (every Infoscience-matched person), with no false positives on the github-only users.
A deployment report claimed Infoscience-anchored persons lose their entire `pulse:owns` list (a "fusion bug"). Driving the real downstream stage functions — in the order `/v2/extract` runs them for a USER root in hybrid runtime — proves the claim wrong: `pulse:owns` is preserved identically whether the person's `@id` is an Infoscience or a github URL, and the rule-based person agent emits the same owns list either way. A live 20-extraction run confirmed it. These tests lock that in: if a future change lets the person's canonical `@id` scheme leak into `pulse:owns` handling, they fail.
GitHub profile READMEs — `<user>/<user>`'s README and `<org>/.github`'s
`profile/README.md` — are the richest free-text "what is this account"
text GitHub exposes, and routinely name the parent institution. They
were never fetched: only repository extractions pulled a README, so an
org/user extraction had nothing but the short API `description`.
- New `GitHubProvider.get_profile_readme(owner, *, is_organization)` —
base no-op default + a real implementation (user → `<owner>/<owner>`
README; org → `GET /repos/{owner}/.github/readme/profile`). Cached,
graceful on the common 404 (most accounts have no profile README).
- The rule-based org and person agents fetch it directly — alongside
their existing `get_organization` / `get_user` calls — and stamp it as
the internal `_profile_readme` field.
- The ROR parent selector now receives it: `_org_context_for_selector`
carries `profile_readme`, and the selector prompt calls out the
description / profile_readme as decisive evidence for the parent.
Verified against a real org: `get_profile_readme` returns the 1047-char
Imaging-Plaza profile README, and `OrganizationAgentV2.run()` stamps it.
Full v2 suite: 761 passed.
feat(v2): resolve ROR parents with an LLM selector agent
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.