Add support for load-time data filtering and hierarchical cohorts #160

MahmoodEtedadi · 2025-05-29T15:18:22Z

Overview

Closes #157

Description of changes

Testing Instructions

You can use the following code to generate a parquet file with 5 columns (h1,...,h5) that can be used to define hierarchies and a loc column that can be used to test high level filtering.

import pandas as pd
import numpy as np
import seismometer as sm
import itertools
import random

# --- Step 1: Load and replicate predictions base file ---
base_df = pd.read_parquet("data/predictions.parquet")
N_TARGET = 3_000_000
base_len = len(base_df)
factor = (N_TARGET + base_len - 1) // base_len  # ceiling division

# Replicate and reset index
replicated = pd.concat([base_df] * factor, ignore_index=True).iloc[:N_TARGET]

# --- Step 2: Modify ScoringTime to spread timestamps uniformly ---
if "ScoringTime" not in replicated.columns:
    raise ValueError("Expected column 'ScoringTime' not found in predictions.parquet")

offsets = np.linspace(-30000, 30000, N_TARGET, dtype=int)  # ±30 seconds
replicated["ScoringTime"] += pd.to_timedelta(offsets, unit="ms")

# --- Step 3: Generate synthetic hierarchy columns h1–h5 ---
TEST_WIDGET_BEHAVIOR = True  # Set to False to generate full 25⁵ combinations
def make_labels(prefix, n):
    return [f"{prefix}{i+1}" for i in range(n)]
loc_values = make_labels("Loc_", 50)
replicated["loc"] = np.random.choice(loc_values, N_TARGET)
if TEST_WIDGET_BEHAVIOR:
    # Simulate realistic hierarchy with fewer values per level, child index should be divisible by parent index.
    def extract_index(label):
        return int(label.split("_")[1])
    
    def build_child_map(child_values, parent_values):
        child_idx = {v: extract_index(v) for v in child_values}
        parent_idx = {v: extract_index(v) for v in parent_values}
        mapping = {
            parent: [child for child in child_values if child_idx[child] % parent_idx[parent] == 0]
            for parent in parent_values
        }
        return mapping
    
    # Precompute valid child label mappings for each parent level
    h1_values = make_labels("H1_", 5)
    h2_values = make_labels("H2_", 10)
    h3_values = make_labels("H3_", 15)
    h4_values = make_labels("H4_", 20)
    h5_values = make_labels("H5_", 25)
    
    h2_map = build_child_map(h2_values, h1_values)
    h3_map = build_child_map(h3_values, h2_values)
    h4_map = build_child_map(h4_values, h3_values)
    h5_map = build_child_map(h5_values, h4_values)
    
    # Stepwise assignment using vectorized indexing
    h1 = np.random.choice(h1_values, N_TARGET)
    h2 = np.array([random.choice(h2_map[parent]) for parent in h1])
    h3 = np.array([random.choice(h3_map[parent]) for parent in h2])
    h4 = np.array([random.choice(h4_map[parent]) for parent in h3])
    h5 = np.array([random.choice(h5_map[parent]) for parent in h4])
    
    combo_df = pd.DataFrame({
        "h1": h1,
        "h2": h2,
        "h3": h3,
        "h4": h4,
        "h5": h5,
    })
else:
  values_dict = {
      "h1": make_labels("H1_", 25),
      "h2": make_labels("H2_", 25),
      "h3": make_labels("H3_", 25),
      "h4": make_labels("H4_", 25),
      "h5": make_labels("H5_", 25),
  }
  
  # Create all combinations (25^5 = 9,765,625)
  all_combos = list(itertools.product(
      values_dict["h1"],
      values_dict["h2"],
      values_dict["h3"],
      values_dict["h4"],
      values_dict["h5"]
  ))

  # Expand combinations to N_TARGET rows
  combo_df = pd.DataFrame(all_combos, columns=["h1", "h2", "h3", "h4", "h5"])
  combo_df = combo_df.sample(n=N_TARGET, random_state=42).reset_index(drop=True)

# Assign to replicated DataFrame
for col in combo_df.columns:
    replicated[col] = combo_df[col]

# --- Step 4: Save to memorypredictions.parquet ---
replicated.to_parquet("data/testfilteringpredictions.parquet", index=False)

In config.yml, update prediction_path to:

  prediction_path: "testfilteringpredictions.parquet"

Also, update usage_config to add:

  cohorts:
    - source: age
      display_name: Age
    - source: race
      display_name: Race
    - source: gender
      display_name: Gender
    - source: A1Cresult  
      display_name: A1C
    - source: insulin  
      display_name: Taking Insulin
    - source: metformin   
      display_name: Taking Metformin
    - source: loc   
      display_name: Location
    - source: h1   
      display_name: H1
    - source: h2   
      display_name: H2
    - source: h3  
      display_name: H3
    - source: h4   
      display_name: H4
    - source: h5  
      display_name: H5
  # filter data 
  load_time_filters:
    - source: age
      values: ["[10-20)", "70+"]
      action: include
    - source: race
      values: ['Unknown']
      action: exclude
    - source: num_medications
      # values: [10] # test with either/both values and range provided
      range:
        min: 10
        max: 30
      action: include
    - source: loc 
      action: keep_top
  cohort_hierarchies:
    - name: hierarchy
      column_order: ['h1','h2','h3','h4','h5']
    # - name: hierarchy
    #   column_order: ['h1', 'h2']

Functional Testing

Run the notebook (comment out # sm.download_example_dataset('diabetes-v2') in the first code cell to prevent overwriting your data)

Verify that the load-time filters were applied:
- sg.dataframe["age"] contains only "[10-20)", "70+" if the include filter worked
- sg.dataframe["race"] does not contain "Unknown"
- sg.dataframe["num_medications"] contains only values between 10 and 30 (inclusive) (or you get the result appropriate for the config). Note that values if provided takes precedence over range.
  - Also verify that 10 appears in the column to confirm inclusion
- If keep_top was applied after other filters:
  - Confirm that sg.dataframe["loc"].nunique() <= 25
Confirm that:
```
sg.available_cohort_groups
```
reflects only the values remaining after filtering.

UI Behavior

All configured cohort dropdowns should appear in the notebook
Hierarchical fields (h1 → h2 → h3 → h4 → h5) are displayed in a single row with arrows
Confirm filtering behavior:
- Selecting a value in h1 correctly updates options shown in h2
- Selecting a value in h2 updates options shown in h3, and so on through h5
Cohorts not part of a hierarchy (e.g., age, gender) appear in the normal layout below

Tool Integration

Confirm that the following tools function correctly with the filtered and structured data:
- ExploreModelEvaluation
- show_cohort_summaries
- Any other downstream analysis or visualizations
No errors should occur during rendering or filtering.

Update config files (cohort hierarchies and load time filters) to test various scenarios.

Performance Testing

Set:
```
TEST_WIDGET_BEHAVIOR = False
```
This will generate all 25⁵ combinations
Re-run the notebook and confirm:
- UI responsiveness remains acceptable
- Filtering behavior still works
- No visual components fail due to excessive cardinality
- sm.start_up does not take too long.

Author Checklist

Linting passes; run early with pre-commit hook.
Tests added for new code and issue being fixed.
Added type annotations and full numpy-style docstrings for new methods.
Draft your news fragment in new changelog/ISSUE.TYPE.rst files; see changelog/README.md.

…ListWidget

… parameter

…gracefully if min/max and values are not comparable (<)