feat: optional term_filter for run-inference and latest-inference-cohort endpoint by chapmanhk · Pull Request #201 · datakind/edvise-api

chapmanhk · 2026-02-02T20:35:19Z

changes

To review only the latest commit: see Checkpoint term support and cohort_labels below.

Run-inference: optional term_filter list (API → Databricks)

API (routers/models.py)
- InferenceRunRequest: added optional term_filter: list[str] | None = None.
- When term_filter is present: require at least one non-empty label; reject empty list (400) and empty or whitespace-only labels (400), with logging before each raise.
- Build DatabricksInferenceRunRequest with term_filter=req.term_filter and pass it to run_pdp_inference.
- After a successful run, log once: user, term_filter (cohorts), and job_run_id.
- Docstring updated to describe the optional term_filter field.
- Batch-not-found error message fixed to use len(batch_result) instead of len(inst_result).
- Renamed res to inference_run_response in trigger_inference_run for clarity.
Databricks (databricks.py)
- DatabricksInferenceRunRequest: added optional term_filter: list[str] | None = None.
- In run_pdp_inference(): build job_params dict; when req.term_filter is not None, set job_params["term_filter"] = json.dumps(req.term_filter). Omit the key when term_filter is None.
- Improved logging: use %-style and include exception in log messages for job lookup and job run failures.
Tests (routers/models_test.py)
- No term_filter in request → 200 and mock called with term_filter=None.
- Single and multiple cohorts in term_filter → 200 and mock receives the correct list.
- Empty list, empty string in list, and whitespace-only label in term_filter → 400 with the expected detail messages.

Latest-inference-cohort endpoint and config from Databricks

API (routers/data.py)
- New GET /{inst_id}/latest-inference-cohort returning LatestInferenceCohortResponse (cohort_labels, valid_student_count, status, batch_name, reason). Always returns 200; uses status='invalid' with reason when no batch, missing config, or no cohort has course data.
- Optional query params: model_name (optional): if omitted, the institution must have exactly one registered model (valid, non-deleted); otherwise invalid. When provided, the model must exist for the institution or the endpoint returns invalid. batch_name (optional): when provided (URL-decoded), cohort selection uses that batch; when omitted, uses the most recent batch that has both STUDENT and COURSE files.
- Config is read from the model's training run in the silver volume via DatabricksControl.read_volume_training_config(inst.name, model_run_id). model_run_id is the latest model version's run ID for the chosen model.
- New/refactored helpers: _resolve_model_name_and_run_id_for_inference (resolve model name, validate existence, fetch run ID), _load_batch_and_dataframes_for_inference (resolve batch, load config, read batch files, validate STUDENT/COURSE), get_batch_for_inference, get_batch_by_name_with_student_and_course, _batch_has_student_and_course, plus existing cohort logic split into _latest_inference_cohort_validation_response, _first_valid_from_ordered_terms, and _resolve_latest_inference_cohort_from_dataframes (≤50 lines per function). Mypy type-narrowing assertions added where optional values are guaranteed after error checks.
Config (config.py)
- Added ENV_TO_VOLUME_SCHEMA = {"DEV": "dev_sst_02", "STAGING": "staging_sst_01"} for Databricks volume paths. Used by databricks.read_volume_training_config and front_end_tables.get_model_cards (replacing a local dict there).
Databricks (databricks.py)
- New read_volume_training_config(inst_name, model_run_id): reads from /Volumes/{env_schema}/{slug}_silver/silver_volume/{model_run_id}/, finds any .toml under that path that contains [preprocessing.selection], and returns that section only (or None if missing/unreadable). Uses ENV_TO_VOLUME_SCHEMA; returns None for unsupported ENV (e.g. LOCAL). New helpers: _parse_config_toml_to_selection(raw: bytes) for TOML parsing and section extraction, and _find_selection_in_toml_under(w, directory_path, inst_name) to list directory and scan .toml files. Guard for response.contents is None before calling .read() (mypy-safe).
Front-end (routers/front_end_tables.py)
- get_model_cards now uses ENV_TO_VOLUME_SCHEMA from config instead of a local schema dict.
Tests
- routers/data_test.py: Tests for get_latest_inference_cohort (unauthorized; no model_name and no models / models exist but none registered; missing cohort column/term/config; valid; valid with batch_name; model_name not found; batch_name not found; loop-back when no course data for most recent; no cohort has course data; missing student id column; zero students meeting criteria; student_criteria references missing column); for get_latest_batch_with_student_and_course (returns batch when it has STUDENT and COURSE, returns None when no batch or no batch has both); and for apply_student_criteria_count, filter_to_most_recent_checkpoint_term, get_ordered_checkpoint_terms.
- databricks_test.py: Tests for _parse_config_toml_to_selection (valid TOML with section, invalid/missing section); for read_volume_training_config (empty inst_name, empty model_run_id, ENV not DEV/STAGING, databricksify raises, WorkspaceClient raises, list_raises, download raises, TOML missing [preprocessing.selection], success when .toml found under run dir). Added return type -> list[DirectoryEntry] to _one_toml_entry for mypy.

Checkpoint term support and cohort_labels (latest commit)

API (routers/data.py)
- Config normalization: Added _normalize_selection_config_for_inference() so pipeline config can use checkpoint_term in student_criteria; at inference entry it is normalized to cohort_term once. Downstream code and DB column stay cohort_term. _allowed_checkpoint_terms_from_config now reads only cohort_term (config is always normalized first).
- Naming: Renamed inference-path naming to "checkpoint term": e.g. get_ordered_checkpoint_terms, filter_to_most_recent_checkpoint_term, _try_checkpoint_term_candidate, _first_valid_from_ordered_terms, CHECKPOINT_TERM_REQUIRED_MSG, _TERM_ORDER, _UNKNOWN_TERM_SORT_RANK. Column and schema remain cohort_term.
- Response model: LatestInferenceCohortResponse.cohort_label → cohort_labels: list[str]; all construction sites updated. User-facing message for missing column says "cohort term" (matches column); other messages keep "checkpoint term."
- Logic: Defensive fallback message in filter_to_most_recent_checkpoint_term; explicit fields on early "Institution not found" response. Extracted _try_checkpoint_term_candidate, _ordered_checkpoint_terms_from_cohort_column, and _ordered_checkpoint_terms_from_entry_year for single-purpose helpers and 50-line limit.
Databricks (databricks.py)
- _parse_config_toml_to_selection: accepts checkpoint_term in config (normalized to cohort_term by API); behavior unchanged for cohort_term.
Tests
- routers/data_test.py: Normalizer (valid, non-dict student_criteria); checkpoint_term in config; cohort_term as string in config; cohort_term empty list (valid, count 0); response contract (cohort_labels list).
- databricks_test.py: test_parse_config_toml_to_selection_accepts_checkpoint_term; return type -> None on the other parse test.

context

Run-inference term_filter: Callers need to be able to choose which cohorts (e.g. "fall 2024-25") to run inference on. This PR adds an optional term_filter list to the run-inference request and passes it through to the Databricks job so the pipeline can filter by those terms/cohorts. When term_filter is omitted, no term_filter key is sent to Databricks and behavior is unchanged (pipeline uses its config default).
Latest-inference-cohort: The UI needs a single "latest inference-ready cohort" (label, valid student count, status) for an institution. This endpoint supports optional model_name and batch_name so the UI can show a specific model/batch or rely on defaults (single registered model, latest batch with student+course files). It reads the chosen model's training/preprocessing selection config from Databricks (silver volume, per model run) and returns the first cohort term that has course data and meets the config criteria. Refactors and tests align with project standards (e.g. function length, naming, error-path coverage).
Checkpoint term and cohort_labels: Pipelines may specify the inference cohort via checkpoint_term in config; the API normalizes this to cohort_term at entry so the rest of the stack stays column-aligned. The response now exposes cohort_labels (list) for consistency and future multi-cohort use. Naming ("checkpoint term") is aligned with product language while keeping the DB column name cohort_term.

questions

None.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1213317049256276
- https://app.asana.com/0/0/1213087332160187

- Add GET /institutions/{inst_id}/latest-inference-cohort endpoint returning LatestInferenceCohortResponse (status, cohort_label, valid_student_count, batch_name, reason). Always 200; status=invalid with reason when no batch, missing config, missing cohort_term, or no cohort has course data. - Define "most recent cohort" as most recent cohort term (e.g. Fall 2024, Spring 2025). Require cohort_term column; support cohort+cohort_term or entry_year+cohort_term. Use config preprocessing.selection.cohort_term to filter allowed terms. Loop back to next cohort term when current has no course data for students. - Add DatabricksControl.read_volume_training_config to read config.toml from institution bronze volume and return [preprocessing.selection] for cohort term filtering and student_criteria. - Add get_ordered_cohort_terms, filter_to_most_recent_cohort, _resolve_latest_inference_cohort_from_dataframes, and helpers; integrate apply_student_criteria_count with course-data restriction. - Add unit tests for get_ordered_cohort_terms and filter_to_most_recent_cohort; API tests for latest-inference-cohort (unauthorized, missing cohort/term/config, valid, loop-back, missing student id, valid_count=0, missing criteria column); apply_student_criteria_count tests; tighten response shape assertions. - Remove unused imports (joinedload, LatestInferenceCohortResponse); apply black.

- Add optional cohort: list[str] to InferenceRunRequest and DatabricksInferenceRunRequest - Validate cohort when provided: reject empty list and empty/whitespace-only labels - Serialize cohort as JSON in job_parameters when set; omit key when None (backward compatible) - Log user, cohorts, and job_run_id once in API after successful run - Fix batch error message to use len(batch_result) instead of len(inst_result) - Add tests: backward compat (no cohort), single/multiple cohorts, empty list, empty string, whitespace-only

- Centralize ENV_TO_VOLUME_SCHEMA in config; use in front_end_tables and databricks - Add read_volume_training_config and _parse_config_toml_to_selection for volume TOML - Refactor _resolve_latest_inference_cohort_from_dataframes into helpers (≤50 lines) - Improve run_pdp_inference logging (%-style, exception context); guard None response.contents - Rename res to inference_run_response in trigger_inference_run - Add tests: get_latest_batch when no batch has both STUDENT and COURSE; read_volume_training_config error paths Co-authored-by: Cursor <cursoragent@cursor.com>

…ection - Add optional batch_name query param; resolve batch by name or latest with student+course - Resolve model from single registered (valid, non-deleted) when model_name omitted - Validate model exists for institution when model_name is provided - Read training config from Databricks volumes for cohort selection - Refactor endpoint into focused helpers; add mypy type-narrowing assertions - Add tests for batch_name, model/batch validation, and registered-model resolution Co-authored-by: Cursor <cursoragent@cursor.com>

- Add term_filter to InferenceRunRequest and DatabricksInferenceRunRequest - Pass term_filter in job_params to Databricks job - Update models router validation and tests - Extend databricks tests and remove unused import Co-authored-by: Cursor <cursoragent@cursor.com>

src/webapp/routers/data.py

- Normalize selection config at inference entry (checkpoint_term -> cohort_term) - Rename inference path to checkpoint term (constants, helpers, docstrings) - Change LatestInferenceCohortResponse.cohort_label to cohort_labels (list) - Extract _try_checkpoint_term_candidate, _ordered_checkpoint_terms_from_* for single-purpose helpers and 50-line limit; add _UNKNOWN_TERM_SORT_RANK - Add tests: normalizer edge cases, cohort_term string/empty list, response contract; databricks checkpoint_term parse; type hint on test Co-authored-by: Cursor <cursoragent@cursor.com>

kaylawilding · 2026-02-18T17:30:51Z

src/webapp/routers/data.py

+) -> Dict[str, Any]:
+    """Normalize selection_config so student_criteria uses column name cohort_term for inference.
+
+    If checkpoint_term is present in student_criteria, set cohort_term to that value and


Sorry what is checkpoint_term here? Is it meant to be the term config field we have added listing the terms for inference?
If so, a couple thoughts, it shouldn't be a term the same as cohort_term field is simply FALL or SPRING, but instead it also includes the year so it will be like FALL 2023-24, eg.

kaylawilding · 2026-02-18T17:38:20Z

src/webapp/routers/data.py

+    return out
+
+
+def _latest_inference_cohort_validation_response(


We should talk about how to square this with that were using terms now, do we want both?

kaylawilding · 2026-02-18T17:39:30Z

src/webapp/routers/data.py

+
+
+def _first_valid_from_ordered_terms(
+    ordered_checkpoint_terms: List[Dict[str, Any]],


ah i see you are doing both, i will need to change the config to terms i believe

kaylawilding · 2026-02-18T18:12:11Z

src/webapp/routers/data.py

+    remove checkpoint_term so downstream (get_ordered_*, apply_student_criteria_count)
+    only see cohort_term. If both keys exist, prefer checkpoint_term.
+    """
+    criteria = selection_config.get("student_criteria")


lets make sure this is not the student criteria field

kaylawilding

adding comments but this is on hold

chapmanhk and others added 4 commits January 30, 2026 10:00

Trigger Github actions

9b236b1

chapmanhk requested a review from Mesh-ach February 3, 2026 17:43

chapmanhk and others added 2 commits February 3, 2026 13:52

fix(databricks_test): add return type to _one_toml_entry for mypy

dea85e5

chapmanhk changed the title ~~feat: Pass optional cohort list to run-inference API and forward to Databricks.~~ feat: optional cohort for run-inference and latest-inference-cohort endpoint Feb 3, 2026

chapmanhk changed the title ~~feat: optional cohort for run-inference and latest-inference-cohort endpoint~~ feat: optional term_filter for run-inference and latest-inference-cohort endpoint Feb 3, 2026

nm3224 requested review from kaylawilding and nm3224 February 12, 2026 19:07

kaylawilding reviewed Feb 12, 2026

View reviewed changes

src/webapp/routers/data.py Outdated Show resolved Hide resolved

src/webapp/routers/data.py Outdated Show resolved Hide resolved

chapmanhk and others added 2 commits February 17, 2026 16:22

Merge branch 'develop' into feat/inference-cohort-selection

f138227

chapmanhk requested review from kaylawilding and removed request for Mesh-ach and nm3224 February 18, 2026 00:26

kaylawilding reviewed Feb 18, 2026

View reviewed changes

chapmanhk marked this pull request as draft February 18, 2026 19:15

kaylawilding reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: optional term_filter for run-inference and latest-inference-cohort endpoint#201

feat: optional term_filter for run-inference and latest-inference-cohort endpoint#201
chapmanhk wants to merge 9 commits intodevelopfrom
feat/inference-cohort-selection

chapmanhk commented Feb 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

kaylawilding Feb 18, 2026 •

edited

Loading

Uh oh!

kaylawilding Feb 18, 2026

Uh oh!

kaylawilding Feb 18, 2026

Uh oh!

kaylawilding Feb 18, 2026

Uh oh!

kaylawilding left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def _first_valid_from_ordered_terms(
		ordered_checkpoint_terms: List[Dict[str, Any]],

Comments

Conversation

chapmanhk commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

changes

Run-inference: optional term_filter list (API → Databricks)

Latest-inference-cohort endpoint and config from Databricks

Checkpoint term support and cohort_labels (latest commit)

context

questions

Uh oh!

Uh oh!

Uh oh!

kaylawilding Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaylawilding Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

kaylawilding Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

kaylawilding Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

kaylawilding left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chapmanhk commented Feb 2, 2026 •

edited

Loading

kaylawilding Feb 18, 2026 •

edited

Loading