Conversation
…: track imputation mask; feat(batch): include bias results in *_qc_summary.json
…: controls for bias diag; test: add basic imputation bias tests
… all strategies in imputation bias tests
…taset, dedicated schema/config, and QC JSON validation
…_cli_test; use output/end_to_end and e2e_* filenames; improve naming across script
…e2e_medium_cli_test (medium), e2e_small_quality_metrics_cli_test (small); standardize output folder names; update imputation params script outputs
…rdized to end_to_end_e2e_cli_test.py
…update README to e2e_* scripts
….py wrapper to e2e_medium; keep standardized naming while preserving legacy entry points
…ts (superseded by output/end_to_end)
…e_medium_cli_test.py directly
…gnostics in PDF/MD and GUI; pass thresholds to report; fix PDF indentation issues
…as thresholds at top-level in session_state; avoid list-index TypeError when enabling bias
… Quality when enabled; pass enabled metrics from pipeline
…or quality_metrics and thresholds; fix report section gating; improve ontology mapping (ID indexing + direct ID pass-through)
…d config/schema under scripts/; scenarios now run end-to-end
Reviewer's GuideThis PR implements a comprehensive imputation-bias diagnostic workflow by introducing a dedicated reporting function, tracking per-cell imputation masks, enhancing mapping logic for ontology IDs, extending the QC report generation with a bias diagnostics section (including traffic-light status bars), and exposing new CLI/GUI controls and example scripts to configure and demonstrate end-to-end bias checks. Sequence diagram for imputation-bias diagnostics in batch QC pipelinesequenceDiagram
participant CLI
participant BatchProcessor
participant MissingDataEngine
participant QualityMetrics
participant ReportGenerator
CLI->>BatchProcessor: Start batch_process (with bias thresholds)
BatchProcessor->>MissingDataEngine: Impute missing data
MissingDataEngine->>BatchProcessor: Return imputed_df, imputation_mask
BatchProcessor->>QualityMetrics: imputation_bias_report(original_df, imputed_df, imputation_mask, thresholds)
QualityMetrics-->>BatchProcessor: Return bias diagnostics DataFrame
BatchProcessor->>ReportGenerator: generate_qc_report(..., bias_diagnostics, bias_thresholds)
ReportGenerator-->>BatchProcessor: QC report with bias diagnostics
Class diagram for imputation bias diagnostics and mask trackingclassDiagram
class MissingDataEngine {
+imputation_mask: Dict[str, pd.Series]
+fit_transform(df: pd.DataFrame) pd.DataFrame
+_update_imputation_mask(before: pd.DataFrame, after: pd.DataFrame, cols: List[str])
_apply_mean(...)
_apply_median(...)
_apply_mode(...)
_apply_knn(...)
_apply_mice(...)
_apply_svd(...)
}
class BiasRow {
+column: str
+n_obs: int
+n_imp: int
+mean_diff: float
+smd: float
+var_ratio: float
+ks_stat: Optional[float]
+ks_p: Optional[float]
+warn: bool
}
class QualityMetrics {
+imputation_bias_report(...): pd.DataFrame
}
MissingDataEngine --> "imputation_mask" BiasRow
QualityMetrics --> "imputation_bias_report" BiasRow
Class diagram for enhanced OntologyMapper mapping logicclassDiagram
class OntologyMapper {
+PREFIX_ALIASES: dict
+DEFAULT_ID_REGEX: dict
+_id_regex: dict
+_alt_to_primary: Dict[str, Dict[str, str]]
+map_term(term, target_ontologies, custom_mappings) Dict[str, Optional[str]]
+_extract_direct_id(term_norm)
+_normalize_text(text: str)
+_build_id_regex_from_config(cfg: dict)
+_scan_alt_map_obo(ontology_file_path: str)
+_augment_alt_ids_from_obo(ontology_file_path: str, mapping: Dict[str, str])
}
OntologyMapper --> "_alt_to_primary" Ontology
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Hey @jorgeMFS - I've reviewed your changes - here's some feedback:
Blocking issues:
- Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- I see the trigger‐reason logic (SMD, variance ratio, KS p‐value) is copied in the PDF report builder, Markdown exporter, and GUI—consider extracting it to a shared utility so any future tweaks stay in sync.
- The code for parsing whether bias diagnostics are enabled (list vs dict, top‐level vs nested under quality_metrics) is scattered across CLI, GUI, and batch processing—centralizing that parsing into one helper would reduce subtle inconsistencies.
- In
imputation_bias_reportyou use sqrt(obs_var) as the pooled standard deviation—double-check this against the conventional two‐sample SMD formula or document this choice to avoid confusion about how SMD is calculated.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- I see the trigger‐reason logic (SMD, variance ratio, KS p‐value) is copied in the PDF report builder, Markdown exporter, and GUI—consider extracting it to a shared utility so any future tweaks stay in sync.
- The code for parsing whether bias diagnostics are enabled (list vs dict, top‐level vs nested under quality_metrics) is scattered across CLI, GUI, and batch processing—centralizing that parsing into one helper would reduce subtle inconsistencies.
- In `imputation_bias_report` you use sqrt(obs_var) as the pooled standard deviation—double-check this against the conventional two‐sample SMD formula or document this choice to avoid confusion about how SMD is calculated.
## Individual Comments
### Comment 1
<location> `src/phenoqc/mapping.py:185` </location>
<code_context>
mappings[ontology_id] = custom_id
return mappings
+ # If the input already looks like an ontology ID, accept it directly for the matching ontology
+ # e.g., 'hp:0001250', 'doid:9352', 'mp:0000001'
+ if term_lower.startswith('hp:'):
+ direct = {'HPO': term.replace('hp:', 'HP:')}
+ # fill non-targets with None below
+ elif term_lower.startswith('doid:'):
+ direct = {'DO': term.replace('doid:', 'DOID:')}
+ elif term_lower.startswith('mp:'):
+ direct = {'MPO': term.replace('mp:', 'MP:')}
+ else:
+ direct = {}
+
</code_context>
<issue_to_address>
Direct ID pass-through logic may not handle case variations or alternative prefixes.
The current logic only matches lowercase prefixes. To improve robustness, normalize input casing or allow for additional prefix formats.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
# If the input already looks like an ontology ID, accept it directly for the matching ontology
# e.g., 'hp:0001250', 'doid:9352', 'mp:0000001'
if term_lower.startswith('hp:'):
direct = {'HPO': term.replace('hp:', 'HP:')}
# fill non-targets with None below
elif term_lower.startswith('doid:'):
direct = {'DO': term.replace('doid:', 'DOID:')}
elif term_lower.startswith('mp:'):
direct = {'MPO': term.replace('mp:', 'MP:')}
else:
direct = {}
=======
# If the input already looks like an ontology ID, accept it directly for the matching ontology
# e.g., 'hp:0001250', 'doid:9352', 'mp:0000001', 'HP:0001250', 'DOID:9352', 'MP:0000001'
import re
direct = {}
hp_match = re.match(r'^(hp:|HP:)(\d+)$', term.strip())
doid_match = re.match(r'^(doid:|DOID:)(\d+)$', term.strip())
mp_match = re.match(r'^(mp:|MP:)(\d+)$', term.strip())
if hp_match:
direct = {'HPO': f'HP:{hp_match.group(2)}'}
elif doid_match:
direct = {'DO': f'DOID:{doid_match.group(2)}'}
elif mp_match:
direct = {'MPO': f'MP:{mp_match.group(2)}'}
else:
direct = {}
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `src/phenoqc/batch_processing.py:788` </location>
<code_context>
+ original_df = pd.concat(all_chunks) if all_chunks else sample_df
+ except Exception:
+ original_df = sample_df
+ bias_df = imputation_bias_report(
+ original_df=original_df if not original_df.empty else chunk,
+ imputed_df=chunk,
+ imputation_mask=getattr(engine, 'imputation_mask', {}),
+ smd_threshold=float(bias_thresholds.get('smd_threshold', 0.10)),
+ var_ratio_low=float(bias_thresholds.get('var_ratio_low', 0.5)),
+ var_ratio_high=float(bias_thresholds.get('var_ratio_high', 2.0)),
+ ks_alpha=float(bias_thresholds.get('ks_alpha', 0.05)),
+ )
+ except Exception as _bias_ex:
+ log_activity(f"{file_path}: imputation-bias diagnostic failed: {_bias_ex}", level="warning")
</code_context>
<issue_to_address>
original_df fallback logic may not always reflect the true original data.
Please clarify or document the fallback to sample_df, or ensure sample_df is always a valid stand-in for the original data when all_chunks is empty.
</issue_to_address>
### Comment 3
<location> `tests/test_imputation_bias.py:8` </location>
<code_context>
+from phenoqc.quality_metrics import imputation_bias_report
+
+
+def test_imputation_bias_mean_imputer_warns_on_variance():
+ rng = np.random.RandomState(0)
+ n = 200
+ x = rng.normal(loc=0.0, scale=1.0, size=n)
+ # Inject 30% missing completely at random
+ mask = rng.rand(n) < 0.3
+ df = pd.DataFrame({'x': x})
+ df_missing = df.copy()
+ df_missing.loc[mask, 'x'] = np.nan
+
+ engine = ImputationEngine({'strategy': 'mean'})
+ df_imputed = engine.fit_transform(df_missing)
+ imputation_mask = getattr(engine, 'imputation_mask', {})
+
+ bias_df = imputation_bias_report(
+ original_df=df_missing, imputed_df=df_imputed, imputation_mask=imputation_mask,
+ columns=['x'], smd_threshold=0.05
+ )
+ assert isinstance(bias_df, pd.DataFrame)
+ assert not bias_df.empty
+ # With mean imputation, variance typically shrinks and warn flags can appear
+ assert bool(bias_df.iloc[0]['warn']) in (True, False)
+
+
</code_context>
<issue_to_address>
Good coverage of mean imputation bias, but missing explicit checks for edge cases like all missing or all observed.
Please add tests for cases where the input column contains only missing values or only observed values to verify that the bias report returns an empty DataFrame or handles these scenarios without errors.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def test_imputation_bias_mean_imputer_warns_on_variance():
rng = np.random.RandomState(0)
n = 200
x = rng.normal(loc=0.0, scale=1.0, size=n)
# Inject 30% missing completely at random
mask = rng.rand(n) < 0.3
df = pd.DataFrame({'x': x})
df_missing = df.copy()
df_missing.loc[mask, 'x'] = np.nan
engine = ImputationEngine({'strategy': 'mean'})
df_imputed = engine.fit_transform(df_missing)
imputation_mask = getattr(engine, 'imputation_mask', {})
bias_df = imputation_bias_report(
original_df=df_missing, imputed_df=df_imputed, imputation_mask=imputation_mask,
columns=['x'], smd_threshold=0.05
)
assert isinstance(bias_df, pd.DataFrame)
assert not bias_df.empty
# With mean imputation, variance typically shrinks and warn flags can appear
assert bool(bias_df.iloc[0]['warn']) in (True, False)
=======
def test_imputation_bias_mean_imputer_warns_on_variance():
rng = np.random.RandomState(0)
n = 200
x = rng.normal(loc=0.0, scale=1.0, size=n)
# Inject 30% missing completely at random
mask = rng.rand(n) < 0.3
df = pd.DataFrame({'x': x})
df_missing = df.copy()
df_missing.loc[mask, 'x'] = np.nan
engine = ImputationEngine({'strategy': 'mean'})
df_imputed = engine.fit_transform(df_missing)
imputation_mask = getattr(engine, 'imputation_mask', {})
bias_df = imputation_bias_report(
original_df=df_missing, imputed_df=df_imputed, imputation_mask=imputation_mask,
columns=['x'], smd_threshold=0.05
)
assert isinstance(bias_df, pd.DataFrame)
assert not bias_df.empty
# With mean imputation, variance typically shrinks and warn flags can appear
assert bool(bias_df.iloc[0]['warn']) in (True, False)
def test_imputation_bias_all_missing_column():
# All values missing
df_missing = pd.DataFrame({'x': [np.nan] * 10})
engine = ImputationEngine({'strategy': 'mean'})
df_imputed = engine.fit_transform(df_missing)
imputation_mask = getattr(engine, 'imputation_mask', {})
bias_df = imputation_bias_report(
original_df=df_missing, imputed_df=df_imputed, imputation_mask=imputation_mask,
columns=['x'], smd_threshold=0.05
)
assert isinstance(bias_df, pd.DataFrame)
# Should be empty or have no rows for 'x'
assert bias_df.empty or not (bias_df['column'] == 'x').any()
def test_imputation_bias_all_observed_column():
# All values observed, no missing
df_observed = pd.DataFrame({'x': np.arange(10)})
engine = ImputationEngine({'strategy': 'mean'})
df_imputed = engine.fit_transform(df_observed)
imputation_mask = getattr(engine, 'imputation_mask', {})
bias_df = imputation_bias_report(
original_df=df_observed, imputed_df=df_imputed, imputation_mask=imputation_mask,
columns=['x'], smd_threshold=0.05
)
assert isinstance(bias_df, pd.DataFrame)
# Should be empty or have no rows for 'x'
assert bias_df.empty or not (bias_df['column'] == 'x').any()
>>>>>>> REPLACE
</suggested_fix>
## Security Issues
### Issue 1
<location> `scripts/end_to_end_e2e_cli_test.py:222` </location>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
…run-tests Initialize alt ID mapping before loading ontologies
…e imputation bias handling
|
@sourcery-ai dismiss @sourcery-ai review @sourcery-ai summary |
Automated Sourcery review dismissed.
|
@sourcery-ai guide |
Summary by Sourcery
Add comprehensive imputation‐bias diagnostics throughout the pipeline, including a new bias report function, CLI/GUI configuration options, report visualization enhancements, imputation mask tracking, mapping improvements, and accompanying example scripts and tests.
New Features:
Enhancements:
Documentation:
Tests: