Skip to content

Feat/imputation improvements#74

Merged
jorgeMFS merged 26 commits intomainfrom
feat/imputation-improvements
Aug 10, 2025
Merged

Feat/imputation improvements#74
jorgeMFS merged 26 commits intomainfrom
feat/imputation-improvements

Conversation

@jorgeMFS
Copy link
Owner

@jorgeMFS jorgeMFS commented Aug 9, 2025

Summary by Sourcery

Add comprehensive imputation‐bias diagnostics throughout the pipeline, including a new bias report function, CLI/GUI configuration options, report visualization enhancements, imputation mask tracking, mapping improvements, and accompanying example scripts and tests.

New Features:

  • Add imputation-bias diagnostic metric to QC reports, CLI, GUI, and batch pipeline
  • Introduce CLI flags and GUI controls to enable bias diagnostics and configure SMD, variance ratio, and KS thresholds
  • Implement imputation_bias_report function computing standardized mean differences, variance ratios, KS‐test triggers, and warning flags
  • Integrate traffic‐light status bar in PDF/Markdown reports for quick visualization of bias warnings
  • Extend central quality_metrics registry to include imputation_bias

Enhancements:

  • Track per-cell imputation mask in missing_data engine to support bias diagnostics
  • Allow direct ontology identifier pass-through in mapping (e.g., HP:0001250)
  • Refactor example and test scripts with consistent naming and paths for end-to-end scenarios

Documentation:

  • Update README with new end-to-end and quality metrics example scripts

Tests:

  • Add tests for imputation_bias_report covering mean, KNN, MICE, SVD imputers and warning thresholds

jorgeMFS added 17 commits August 9, 2025 20:05
…: track imputation mask; feat(batch): include bias results in *_qc_summary.json
…: controls for bias diag; test: add basic imputation bias tests
…taset, dedicated schema/config, and QC JSON validation
…_cli_test; use output/end_to_end and e2e_* filenames; improve naming across script
…e2e_medium_cli_test (medium), e2e_small_quality_metrics_cli_test (small); standardize output folder names; update imputation params script outputs
….py wrapper to e2e_medium; keep standardized naming while preserving legacy entry points
…gnostics in PDF/MD and GUI; pass thresholds to report; fix PDF indentation issues
…as thresholds at top-level in session_state; avoid list-index TypeError when enabling bias
… Quality when enabled; pass enabled metrics from pipeline
…or quality_metrics and thresholds; fix report section gating; improve ontology mapping (ID indexing + direct ID pass-through)
…d config/schema under scripts/; scenarios now run end-to-end
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Aug 9, 2025

Reviewer's Guide

This PR implements a comprehensive imputation-bias diagnostic workflow by introducing a dedicated reporting function, tracking per-cell imputation masks, enhancing mapping logic for ontology IDs, extending the QC report generation with a bias diagnostics section (including traffic-light status bars), and exposing new CLI/GUI controls and example scripts to configure and demonstrate end-to-end bias checks.

Sequence diagram for imputation-bias diagnostics in batch QC pipeline

sequenceDiagram
    participant CLI
    participant BatchProcessor
    participant MissingDataEngine
    participant QualityMetrics
    participant ReportGenerator
    CLI->>BatchProcessor: Start batch_process (with bias thresholds)
    BatchProcessor->>MissingDataEngine: Impute missing data
    MissingDataEngine->>BatchProcessor: Return imputed_df, imputation_mask
    BatchProcessor->>QualityMetrics: imputation_bias_report(original_df, imputed_df, imputation_mask, thresholds)
    QualityMetrics-->>BatchProcessor: Return bias diagnostics DataFrame
    BatchProcessor->>ReportGenerator: generate_qc_report(..., bias_diagnostics, bias_thresholds)
    ReportGenerator-->>BatchProcessor: QC report with bias diagnostics
Loading

Class diagram for imputation bias diagnostics and mask tracking

classDiagram
    class MissingDataEngine {
        +imputation_mask: Dict[str, pd.Series]
        +fit_transform(df: pd.DataFrame) pd.DataFrame
        +_update_imputation_mask(before: pd.DataFrame, after: pd.DataFrame, cols: List[str])
        _apply_mean(...)
        _apply_median(...)
        _apply_mode(...)
        _apply_knn(...)
        _apply_mice(...)
        _apply_svd(...)
    }
    class BiasRow {
        +column: str
        +n_obs: int
        +n_imp: int
        +mean_diff: float
        +smd: float
        +var_ratio: float
        +ks_stat: Optional[float]
        +ks_p: Optional[float]
        +warn: bool
    }
    class QualityMetrics {
        +imputation_bias_report(...): pd.DataFrame
    }
    MissingDataEngine --> "imputation_mask" BiasRow
    QualityMetrics --> "imputation_bias_report" BiasRow
Loading

Class diagram for enhanced OntologyMapper mapping logic

classDiagram
    class OntologyMapper {
        +PREFIX_ALIASES: dict
        +DEFAULT_ID_REGEX: dict
        +_id_regex: dict
        +_alt_to_primary: Dict[str, Dict[str, str]]
        +map_term(term, target_ontologies, custom_mappings) Dict[str, Optional[str]]
        +_extract_direct_id(term_norm)
        +_normalize_text(text: str)
        +_build_id_regex_from_config(cfg: dict)
        +_scan_alt_map_obo(ontology_file_path: str)
        +_augment_alt_ids_from_obo(ontology_file_path: str, mapping: Dict[str, str])
    }
    OntologyMapper --> "_alt_to_primary" Ontology
Loading

File-Level Changes

Change Details Files
Add imputation-bias reporting function and pipeline integration
  • Define BiasRow dataclass and imputation_bias_report in quality_metrics.py
  • Invoke imputation_bias_report in batch_processing.py and attach bias_diagnostics & thresholds to QC report payload
  • Populate QC summary JSON with bias_rows under quality_metrics.imputation_bias
src/phenoqc/quality_metrics.py
src/phenoqc/batch_processing.py
Track per-cell imputation mask in the missing_data engine
  • Capture before/after DataFrame slices for each imputation strategy
  • Implement _update_imputation_mask to flag newly imputed cells
  • Store imputation_mask on the engine for downstream bias analysis
src/phenoqc/missing_data.py
Enhance OntologyMapper to support direct IDs, normalization, and alt_id indexing
  • Add PREFIX_ALIASES and DEFAULT_ID_REGEX with _build_id_regex_from_config
  • Introduce _normalize_text for term normalization
  • Extend parse_ontology to index alt_id/xref and add _augment_alt_ids_from_obo and _scan_alt_map_obo
  • Revamp map_term to extract direct IDs, apply prefix mapping, and fall back to fuzzy matching
src/phenoqc/mapping.py
tests/test_mapping.py
Extend QC report generation with bias diagnostics and status visuals
  • Add bias_diagnostics and bias_thresholds parameters to generate_qc_report
  • Insert PDF section showing triggers and traffic-light table for top variables
  • Insert Markdown section rendering triggers and a markdown table for bias diagnostics
src/phenoqc/reporting.py
Expose CLI/GUI configuration flags and update examples/tests
  • Add --bias-* flags to CLI and pass thresholds into batch processing
  • Add GUI controls to enable bias diagnostics and sync thresholds to session_state
  • Rename and add end-to-end and quality-metrics scripts under scripts/, update unified_scenarios_test
  • Add test_imputation_bias.py and update README with new example scripts
src/phenoqc/cli.py
src/phenoqc/gui/gui.py
scripts/
tests/test_imputation_bias.py
README.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov-commenter
Copy link

codecov-commenter commented Aug 9, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 38.98305% with 252 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/phenoqc/reporting.py 2.45% 113 Missing and 6 partials ⚠️
src/phenoqc/gui/gui.py 0.00% 60 Missing ⚠️
src/phenoqc/mapping.py 71.12% 31 Missing and 10 partials ⚠️
src/phenoqc/batch_processing.py 6.25% 15 Missing ⚠️
src/phenoqc/missing_data.py 68.00% 4 Missing and 4 partials ⚠️
src/phenoqc/quality_metrics.py 88.63% 4 Missing and 1 partial ⚠️
src/phenoqc/cli.py 0.00% 4 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Files with missing lines Coverage Δ
src/phenoqc/cli.py 0.00% <0.00%> (ø)
src/phenoqc/quality_metrics.py 85.29% <88.63%> (+1.16%) ⬆️
src/phenoqc/missing_data.py 65.54% <68.00%> (+9.45%) ⬆️
src/phenoqc/batch_processing.py 16.39% <6.25%> (-0.59%) ⬇️
src/phenoqc/mapping.py 68.37% <71.12%> (+4.41%) ⬆️
src/phenoqc/gui/gui.py 0.00% <0.00%> (ø)
src/phenoqc/reporting.py 37.97% <2.45%> (-11.26%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sourcery-ai[bot]
sourcery-ai bot previously requested changes Aug 9, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jorgeMFS - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

  • I see the trigger‐reason logic (SMD, variance ratio, KS p‐value) is copied in the PDF report builder, Markdown exporter, and GUI—consider extracting it to a shared utility so any future tweaks stay in sync.
  • The code for parsing whether bias diagnostics are enabled (list vs dict, top‐level vs nested under quality_metrics) is scattered across CLI, GUI, and batch processing—centralizing that parsing into one helper would reduce subtle inconsistencies.
  • In imputation_bias_report you use sqrt(obs_var) as the pooled standard deviation—double-check this against the conventional two‐sample SMD formula or document this choice to avoid confusion about how SMD is calculated.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- I see the trigger‐reason logic (SMD, variance ratio, KS p‐value) is copied in the PDF report builder, Markdown exporter, and GUI—consider extracting it to a shared utility so any future tweaks stay in sync.
- The code for parsing whether bias diagnostics are enabled (list vs dict, top‐level vs nested under quality_metrics) is scattered across CLI, GUI, and batch processing—centralizing that parsing into one helper would reduce subtle inconsistencies.
- In `imputation_bias_report` you use sqrt(obs_var) as the pooled standard deviation—double-check this against the conventional two‐sample SMD formula or document this choice to avoid confusion about how SMD is calculated.

## Individual Comments

### Comment 1
<location> `src/phenoqc/mapping.py:185` </location>
<code_context>
                 mappings[ontology_id] = custom_id
             return mappings

+        # If the input already looks like an ontology ID, accept it directly for the matching ontology
+        # e.g., 'hp:0001250', 'doid:9352', 'mp:0000001'
+        if term_lower.startswith('hp:'):
+            direct = {'HPO': term.replace('hp:', 'HP:')}
+            # fill non-targets with None below
+        elif term_lower.startswith('doid:'):
+            direct = {'DO': term.replace('doid:', 'DOID:')}
+        elif term_lower.startswith('mp:'):
+            direct = {'MPO': term.replace('mp:', 'MP:')}
+        else:
+            direct = {}
+
</code_context>

<issue_to_address>
Direct ID pass-through logic may not handle case variations or alternative prefixes.

The current logic only matches lowercase prefixes. To improve robustness, normalize input casing or allow for additional prefix formats.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        # If the input already looks like an ontology ID, accept it directly for the matching ontology
        # e.g., 'hp:0001250', 'doid:9352', 'mp:0000001'
        if term_lower.startswith('hp:'):
            direct = {'HPO': term.replace('hp:', 'HP:')}
            # fill non-targets with None below
        elif term_lower.startswith('doid:'):
            direct = {'DO': term.replace('doid:', 'DOID:')}
        elif term_lower.startswith('mp:'):
            direct = {'MPO': term.replace('mp:', 'MP:')}
        else:
            direct = {}
=======
        # If the input already looks like an ontology ID, accept it directly for the matching ontology
        # e.g., 'hp:0001250', 'doid:9352', 'mp:0000001', 'HP:0001250', 'DOID:9352', 'MP:0000001'
        import re

        direct = {}
        hp_match = re.match(r'^(hp:|HP:)(\d+)$', term.strip())
        doid_match = re.match(r'^(doid:|DOID:)(\d+)$', term.strip())
        mp_match = re.match(r'^(mp:|MP:)(\d+)$', term.strip())

        if hp_match:
            direct = {'HPO': f'HP:{hp_match.group(2)}'}
        elif doid_match:
            direct = {'DO': f'DOID:{doid_match.group(2)}'}
        elif mp_match:
            direct = {'MPO': f'MP:{mp_match.group(2)}'}
        else:
            direct = {}
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `src/phenoqc/batch_processing.py:788` </location>
<code_context>
+                        original_df = pd.concat(all_chunks) if all_chunks else sample_df
+                    except Exception:
+                        original_df = sample_df
+                    bias_df = imputation_bias_report(
+                        original_df=original_df if not original_df.empty else chunk,
+                        imputed_df=chunk,
+                        imputation_mask=getattr(engine, 'imputation_mask', {}),
+                        smd_threshold=float(bias_thresholds.get('smd_threshold', 0.10)),
+                        var_ratio_low=float(bias_thresholds.get('var_ratio_low', 0.5)),
+                        var_ratio_high=float(bias_thresholds.get('var_ratio_high', 2.0)),
+                        ks_alpha=float(bias_thresholds.get('ks_alpha', 0.05)),
+                    )
+            except Exception as _bias_ex:
+                log_activity(f"{file_path}: imputation-bias diagnostic failed: {_bias_ex}", level="warning")
</code_context>

<issue_to_address>
original_df fallback logic may not always reflect the true original data.

Please clarify or document the fallback to sample_df, or ensure sample_df is always a valid stand-in for the original data when all_chunks is empty.
</issue_to_address>

### Comment 3
<location> `tests/test_imputation_bias.py:8` </location>
<code_context>
+from phenoqc.quality_metrics import imputation_bias_report
+
+
+def test_imputation_bias_mean_imputer_warns_on_variance():
+    rng = np.random.RandomState(0)
+    n = 200
+    x = rng.normal(loc=0.0, scale=1.0, size=n)
+    # Inject 30% missing completely at random
+    mask = rng.rand(n) < 0.3
+    df = pd.DataFrame({'x': x})
+    df_missing = df.copy()
+    df_missing.loc[mask, 'x'] = np.nan
+
+    engine = ImputationEngine({'strategy': 'mean'})
+    df_imputed = engine.fit_transform(df_missing)
+    imputation_mask = getattr(engine, 'imputation_mask', {})
+
+    bias_df = imputation_bias_report(
+        original_df=df_missing, imputed_df=df_imputed, imputation_mask=imputation_mask,
+        columns=['x'], smd_threshold=0.05
+    )
+    assert isinstance(bias_df, pd.DataFrame)
+    assert not bias_df.empty
+    # With mean imputation, variance typically shrinks and warn flags can appear
+    assert bool(bias_df.iloc[0]['warn']) in (True, False)
+
+
</code_context>

<issue_to_address>
Good coverage of mean imputation bias, but missing explicit checks for edge cases like all missing or all observed.

Please add tests for cases where the input column contains only missing values or only observed values to verify that the bias report returns an empty DataFrame or handles these scenarios without errors.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_imputation_bias_mean_imputer_warns_on_variance():
    rng = np.random.RandomState(0)
    n = 200
    x = rng.normal(loc=0.0, scale=1.0, size=n)
    # Inject 30% missing completely at random
    mask = rng.rand(n) < 0.3
    df = pd.DataFrame({'x': x})
    df_missing = df.copy()
    df_missing.loc[mask, 'x'] = np.nan

    engine = ImputationEngine({'strategy': 'mean'})
    df_imputed = engine.fit_transform(df_missing)
    imputation_mask = getattr(engine, 'imputation_mask', {})

    bias_df = imputation_bias_report(
        original_df=df_missing, imputed_df=df_imputed, imputation_mask=imputation_mask,
        columns=['x'], smd_threshold=0.05
    )
    assert isinstance(bias_df, pd.DataFrame)
    assert not bias_df.empty
    # With mean imputation, variance typically shrinks and warn flags can appear
    assert bool(bias_df.iloc[0]['warn']) in (True, False)
=======
def test_imputation_bias_mean_imputer_warns_on_variance():
    rng = np.random.RandomState(0)
    n = 200
    x = rng.normal(loc=0.0, scale=1.0, size=n)
    # Inject 30% missing completely at random
    mask = rng.rand(n) < 0.3
    df = pd.DataFrame({'x': x})
    df_missing = df.copy()
    df_missing.loc[mask, 'x'] = np.nan

    engine = ImputationEngine({'strategy': 'mean'})
    df_imputed = engine.fit_transform(df_missing)
    imputation_mask = getattr(engine, 'imputation_mask', {})

    bias_df = imputation_bias_report(
        original_df=df_missing, imputed_df=df_imputed, imputation_mask=imputation_mask,
        columns=['x'], smd_threshold=0.05
    )
    assert isinstance(bias_df, pd.DataFrame)
    assert not bias_df.empty
    # With mean imputation, variance typically shrinks and warn flags can appear
    assert bool(bias_df.iloc[0]['warn']) in (True, False)


def test_imputation_bias_all_missing_column():
    # All values missing
    df_missing = pd.DataFrame({'x': [np.nan] * 10})
    engine = ImputationEngine({'strategy': 'mean'})
    df_imputed = engine.fit_transform(df_missing)
    imputation_mask = getattr(engine, 'imputation_mask', {})
    bias_df = imputation_bias_report(
        original_df=df_missing, imputed_df=df_imputed, imputation_mask=imputation_mask,
        columns=['x'], smd_threshold=0.05
    )
    assert isinstance(bias_df, pd.DataFrame)
    # Should be empty or have no rows for 'x'
    assert bias_df.empty or not (bias_df['column'] == 'x').any()


def test_imputation_bias_all_observed_column():
    # All values observed, no missing
    df_observed = pd.DataFrame({'x': np.arange(10)})
    engine = ImputationEngine({'strategy': 'mean'})
    df_imputed = engine.fit_transform(df_observed)
    imputation_mask = getattr(engine, 'imputation_mask', {})
    bias_df = imputation_bias_report(
        original_df=df_observed, imputed_df=df_imputed, imputation_mask=imputation_mask,
        columns=['x'], smd_threshold=0.05
    )
    assert isinstance(bias_df, pd.DataFrame)
    # Should be empty or have no rows for 'x'
    assert bias_df.empty or not (bias_df['column'] == 'x').any()
>>>>>>> REPLACE

</suggested_fix>

## Security Issues

### Issue 1
<location> `scripts/end_to_end_e2e_cli_test.py:222` </location>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

jorgeMFS and others added 9 commits August 10, 2025 01:01
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
…run-tests

Initialize alt ID mapping before loading ontologies
@jorgeMFS
Copy link
Owner Author

@sourcery-ai dismiss @sourcery-ai review @sourcery-ai summary

@sourcery-ai sourcery-ai bot dismissed their stale review August 10, 2025 11:17

Automated Sourcery review dismissed.

@jorgeMFS
Copy link
Owner Author

@sourcery-ai guide

@jorgeMFS jorgeMFS merged commit 0936d66 into main Aug 10, 2025
4 checks passed
@jorgeMFS jorgeMFS deleted the feat/imputation-improvements branch August 10, 2025 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants