Skip to content

Conversation

MahmoodEtedadi
Copy link
Contributor

@MahmoodEtedadi MahmoodEtedadi commented May 29, 2025

Overview

Closes #157

Description of changes

Testing Instructions

You can use the following code to generate a parquet file with 5 columns (h1,...,h5) that can be used to define hierarchies and a loc column that can be used to test high level filtering.

import pandas as pd
import numpy as np
import seismometer as sm
import itertools
import random

# --- Step 1: Load and replicate predictions base file ---
base_df = pd.read_parquet("data/predictions.parquet")
N_TARGET = 3_000_000
base_len = len(base_df)
factor = (N_TARGET + base_len - 1) // base_len  # ceiling division

# Replicate and reset index
replicated = pd.concat([base_df] * factor, ignore_index=True).iloc[:N_TARGET]

# --- Step 2: Modify ScoringTime to spread timestamps uniformly ---
if "ScoringTime" not in replicated.columns:
    raise ValueError("Expected column 'ScoringTime' not found in predictions.parquet")

offsets = np.linspace(-30000, 30000, N_TARGET, dtype=int)  # ±30 seconds
replicated["ScoringTime"] += pd.to_timedelta(offsets, unit="ms")

# --- Step 3: Generate synthetic hierarchy columns h1–h5 ---
TEST_WIDGET_BEHAVIOR = True  # Set to False to generate full 25⁵ combinations
def make_labels(prefix, n):
    return [f"{prefix}{i+1}" for i in range(n)]
loc_values = make_labels("Loc_", 50)
replicated["loc"] = np.random.choice(loc_values, N_TARGET)
if TEST_WIDGET_BEHAVIOR:
    # Simulate realistic hierarchy with fewer values per level, child index should be divisible by parent index.
    def extract_index(label):
        return int(label.split("_")[1])
    
    def build_child_map(child_values, parent_values):
        child_idx = {v: extract_index(v) for v in child_values}
        parent_idx = {v: extract_index(v) for v in parent_values}
        mapping = {
            parent: [child for child in child_values if child_idx[child] % parent_idx[parent] == 0]
            for parent in parent_values
        }
        return mapping
    
    # Precompute valid child label mappings for each parent level
    h1_values = make_labels("H1_", 5)
    h2_values = make_labels("H2_", 10)
    h3_values = make_labels("H3_", 15)
    h4_values = make_labels("H4_", 20)
    h5_values = make_labels("H5_", 25)
    
    h2_map = build_child_map(h2_values, h1_values)
    h3_map = build_child_map(h3_values, h2_values)
    h4_map = build_child_map(h4_values, h3_values)
    h5_map = build_child_map(h5_values, h4_values)
    
    # Stepwise assignment using vectorized indexing
    h1 = np.random.choice(h1_values, N_TARGET)
    h2 = np.array([random.choice(h2_map[parent]) for parent in h1])
    h3 = np.array([random.choice(h3_map[parent]) for parent in h2])
    h4 = np.array([random.choice(h4_map[parent]) for parent in h3])
    h5 = np.array([random.choice(h5_map[parent]) for parent in h4])
    
    combo_df = pd.DataFrame({
        "h1": h1,
        "h2": h2,
        "h3": h3,
        "h4": h4,
        "h5": h5,
    })
else:
  values_dict = {
      "h1": make_labels("H1_", 25),
      "h2": make_labels("H2_", 25),
      "h3": make_labels("H3_", 25),
      "h4": make_labels("H4_", 25),
      "h5": make_labels("H5_", 25),
  }
  
  # Create all combinations (25^5 = 9,765,625)
  all_combos = list(itertools.product(
      values_dict["h1"],
      values_dict["h2"],
      values_dict["h3"],
      values_dict["h4"],
      values_dict["h5"]
  ))

  # Expand combinations to N_TARGET rows
  combo_df = pd.DataFrame(all_combos, columns=["h1", "h2", "h3", "h4", "h5"])
  combo_df = combo_df.sample(n=N_TARGET, random_state=42).reset_index(drop=True)

# Assign to replicated DataFrame
for col in combo_df.columns:
    replicated[col] = combo_df[col]

# --- Step 4: Save to memorypredictions.parquet ---
replicated.to_parquet("data/testfilteringpredictions.parquet", index=False)

In config.yml, update prediction_path to:

  prediction_path: "testfilteringpredictions.parquet"

Also, update usage_config to add:

  cohorts:
    - source: age
      display_name: Age
    - source: race
      display_name: Race
    - source: gender
      display_name: Gender
    - source: A1Cresult  
      display_name: A1C
    - source: insulin  
      display_name: Taking Insulin
    - source: metformin   
      display_name: Taking Metformin
    - source: loc   
      display_name: Location
    - source: h1   
      display_name: H1
    - source: h2   
      display_name: H2
    - source: h3  
      display_name: H3
    - source: h4   
      display_name: H4
    - source: h5  
      display_name: H5
  # filter data 
  load_time_filters:
    - source: age
      values: ["[10-20)", "70+"]
      action: include
    - source: race
      values: ['Unknown']
      action: exclude
    - source: num_medications
      # values: [10] # test with either/both values and range provided
      range:
        min: 10
        max: 30
      action: include
    - source: loc 
      action: keep_top
  cohort_hierarchies:
    - name: hierarchy
      column_order: ['h1','h2','h3','h4','h5']
    # - name: hierarchy
    #   column_order: ['h1', 'h2']

Functional Testing

Run the notebook (comment out # sm.download_example_dataset('diabetes-v2') in the first code cell to prevent overwriting your data)

  • Verify that the load-time filters were applied:

    • sg.dataframe["age"] contains only "[10-20)", "70+" if the include filter worked
    • sg.dataframe["race"] does not contain "Unknown"
    • sg.dataframe["num_medications"] contains only values between 10 and 30 (inclusive) (or you get the result appropriate for the config). Note that values if provided takes precedence over range.
      • Also verify that 10 appears in the column to confirm inclusion
    • If keep_top was applied after other filters:
      • Confirm that sg.dataframe["loc"].nunique() <= 25
  • Confirm that:

    sg.available_cohort_groups

    reflects only the values remaining after filtering.

UI Behavior

  • All configured cohort dropdowns should appear in the notebook
  • Hierarchical fields (h1 → h2 → h3 → h4 → h5) are displayed in a single row with arrows
  • Confirm filtering behavior:
    • Selecting a value in h1 correctly updates options shown in h2
    • Selecting a value in h2 updates options shown in h3, and so on through h5
  • Cohorts not part of a hierarchy (e.g., age, gender) appear in the normal layout below

Tool Integration

  • Confirm that the following tools function correctly with the filtered and structured data:

    • ExploreModelEvaluation
    • show_cohort_summaries
    • Any other downstream analysis or visualizations

    No errors should occur during rendering or filtering.

Update config files (cohort hierarchies and load time filters) to test various scenarios.

Performance Testing

  • Set:
    TEST_WIDGET_BEHAVIOR = False
  • This will generate all 25⁵ combinations
  • Re-run the notebook and confirm:
    • UI responsiveness remains acceptable
    • Filtering behavior still works
    • No visual components fail due to excessive cardinality
    • sm.start_up does not take too long.

Author Checklist

  • Linting passes; run early with pre-commit hook.
  • Tests added for new code and issue being fixed.
  • Added type annotations and full numpy-style docstrings for new methods.
  • Draft your news fragment in new changelog/ISSUE.TYPE.rst files; see changelog/README.md.

@MahmoodEtedadi MahmoodEtedadi marked this pull request as draft May 29, 2025 15:18
@MahmoodEtedadi MahmoodEtedadi marked this pull request as ready for review June 2, 2025 17:29
@MahmoodEtedadi MahmoodEtedadi requested review from diehlbw and gbowlin June 3, 2025 20:29
@MahmoodEtedadi MahmoodEtedadi requested a review from gbowlin June 17, 2025 18:50
@MahmoodEtedadi MahmoodEtedadi requested a review from diehlbw June 19, 2025 18:47
@MahmoodEtedadi MahmoodEtedadi requested review from diehlbw and gbowlin July 7, 2025 20:07
@MahmoodEtedadi MahmoodEtedadi requested a review from gbowlin July 9, 2025 17:23
gbowlin
gbowlin previously approved these changes Jul 9, 2025
@MahmoodEtedadi MahmoodEtedadi requested a review from diehlbw August 1, 2025 15:24
@MahmoodEtedadi MahmoodEtedadi requested a review from diehlbw August 1, 2025 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High level (e.g., organization) data filtering
3 participants