Skip to content

Enhance MetaTraits transform with chemical mapping and expanded METPO predicates#531

Merged
realmarcin merged 92 commits into
masterfrom
fix_metatraits
Apr 18, 2026
Merged

Enhance MetaTraits transform with chemical mapping and expanded METPO predicates#531
realmarcin merged 92 commits into
masterfrom
fix_metatraits

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

This PR enhances the MetaTraits transform to resolve more chemical-related traits by integrating unified chemical mapping infrastructure and expanding METPO predicate coverage.

Changes

Phase 1: Chemical Mapping Infrastructure

  • ✅ Merged chemical_mappings branch (164,705 ChEBI IDs)
  • ✅ Added mappings/unified_chemical_mappings.tsv.gz (8.4 MB)
  • ✅ Added kg_microbe/utils/chemical_mapping_utils.py (ChemicalMappingLoader class)

Phase 2-4: MetaTraits Transform Enhancement

  • ✅ Integrated ChemicalMappingLoader in metatraits transform
  • ✅ Implemented _resolve_chemical_trait() method with 8 pattern matchers:
    • carbon source: X → METPO:2000006 (uses as carbon source)
    • produces: X → METPO:2000202 (produces)
    • ferments: X → METPO:2000011 (ferments)
    • hydrolyzes: X → METPO:2000013 (hydrolyzes)
    • oxidizes: X → METPO:2000016 (oxidizes)
    • reduces: X → METPO:2000017 (reduces)
    • degrades: X → METPO:2000007 (degrades)
    • utilizes: X → METPO:2000001 (organism interacts with chemical)
  • ✅ Expanded METPO predicate mappings from 3 to 30 predicates
  • ✅ Added biolink:interacts_with → RO:0002434 mapping
  • ✅ Integrated chemical resolver as Tier 1.5 in lookup hierarchy

Phase 5: ChEBI Category Fix

  • ✅ Fixed ChEBI categories to use biolink:ChemicalSubstance
  • ✅ Updated constants.py to normalize SmallMolecule → ChemicalSubstance
  • ✅ Added new CHEBI_CATEGORY constant

Expected Impact

  • 10-30% increase in mapped edges for chemical-related traits
  • ~100K+ reduction in unmapped_traits.tsv
  • More semantic specificity: ferments/oxidizes/degrades vs generic capable_of
  • Standardized ChEBI IDs with canonical names

Baseline Metrics (Before Enhancement)

  • Edges: 829,353
  • Unmapped traits: 5,270,596
  • Top unmapped patterns: produces: acetate (43,777), carbon source: methyl (43,770), utilizes: citrate (41,047)

Files Changed

Modified:

  • kg_microbe/transform_utils/metatraits/metatraits.py (+127, -24 lines)
  • kg_microbe/transform_utils/constants.py (ChEBI category updates)

Added (from chemical_mappings merge):

  • mappings/unified_chemical_mappings.tsv.gz (8.4 MB, 164,705 ChEBI entries)
  • kg_microbe/utils/chemical_mapping_utils.py (ChemicalMappingLoader)
  • scripts/consolidate_chemical_mappings.py
  • tests/test_chemical_mapping_utils.py
  • Documentation in mappings/README.md and CONSOLIDATION_SUMMARY.md

Testing

  • ✅ Code formatted (black + ruff)
  • ✅ All commits follow project conventions
  • ⏳ Full transform validation pending completion

Documentation

Comprehensive implementation summary in METATRAITS_ENHANCEMENT_SUMMARY.md

🤖 Generated with Claude Code

realmarcin and others added 17 commits March 20, 2026 19:24
The transform was looking for input files in data/raw/metatraits/
but download.yaml places them directly in data/raw/.

Changed: input_base = Path(self.input_base_dir) / "metatraits"
To: input_base = Path(self.input_base_dir)

This aligns with the actual file locations:
- data/raw/ncbi_species_summary.jsonl.gz
- data/raw/ncbi_genus_summary.jsonl.gz
- data/raw/ncbi_family_summary.jsonl.gz

Transform now runs successfully and generates ~18,940 edges.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The adapter was trying to open ncbitaxon.owl as a SQLite database,
which always failed and triggered remote download fallback.

Changes:
- Point to ncbitaxon.db instead of NCBITAXON_SOURCE (which is .owl)
- Add existence check before trying local database
- Improve error messages to show which database is being used
- Handle corrupted database gracefully with informative messages

This prevents unnecessary 2GB downloads when a local database exists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Brings in unified chemical mapping infrastructure:
- mappings/unified_chemical_mappings.tsv.gz (164,705 ChEBI IDs)
- kg_microbe/utils/chemical_mapping_utils.py (ChemicalMappingLoader)
- Updated bacdive, mediadive, ctd transforms to use unified mappings
- Documentation in mappings/README.md and CONSOLIDATION_SUMMARY.md

This enables metatraits transform to resolve chemical traits to ChEBI IDs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… predicates

**Phase 1: Merge chemical_mappings branch** ✓ (already committed)

**Phase 2-4: Integration and Enhancement**

1. **Add ChemicalMappingLoader integration**
   - Import ChemicalMappingLoader from kg_microbe.utils.chemical_mapping_utils
   - Initialize loader in __init__ with graceful fallback on error
   - Loads 164,705 ChEBI IDs with synonyms for chemical trait resolution

2. **Implement _resolve_chemical_trait() method**
   - Pattern-based chemical trait resolver (Tier 1.5 in lookup hierarchy)
   - Handles 8 common trait patterns:
     * carbon source: X -> METPO:2000006 (uses as carbon source)
     * produces: X -> METPO:2000202 (produces)
     * ferments: X -> METPO:2000011 (ferments)
     * hydrolyzes: X -> METPO:2000013 (hydrolyzes)
     * oxidizes: X -> METPO:2000016 (oxidizes)
     * reduces: X -> METPO:2000017 (reduces)
     * degrades: X -> METPO:2000007 (degrades)
     * utilizes: X -> METPO:2000001 (organism interacts with chemical)
   - Returns ChEBI ID, category, canonical name, and METPO predicate

3. **Expand METPO predicate mappings**
   - Increased from 3 to 30 METPO predicates
   - Added chemical interaction predicates (positive/negative)
   - Added enzyme activity, growth medium, assimilation predicates
   - Added biolink:interacts_with -> RO:0002434 mapping

4. **Integrate chemical resolver into trait resolution**
   - Lookup order: microbial-trait-mappings (Tier 1) -> chemical resolver
     (Tier 1.5) -> METPO/custom_curies (Tier 2/3)
   - Chemical resolver extracts chemical name from trait patterns
   - Resolves to standardized ChEBI IDs with canonical names
   - Converts METPO predicates to biolink for KGX compliance

**Expected Impact:**
- 10-30% increase in mapped edges for chemical-related traits
- More specific predicates (ferments, degrades, oxidizes vs generic capable_of)
- Better queryability with standardized ChEBI IDs and semantic predicates

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changes:
- Update metatraits chemical resolver to use biolink:ChemicalSubstance
- Update constants.py to make ChemicalSubstance the standard for CHEBI
- Add new CHEBI_CATEGORY constant (biolink:ChemicalSubstance)
- Deprecate SmallMolecule in favor of ChemicalSubstance
- Update SMALL_MOLECULE_CATEGORY to normalize to ChemicalSubstance

Rationale: ChemicalSubstance is the preferred category for CHEBI-mapped
chemicals. SmallMolecule should be normalized to ChemicalSubstance for
consistency across transforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The microbial_trait_mappings.py was mapping CHEBI to ChemicalEntity,
which overrode the ChemicalSubstance category from the chemical resolver
and constants.

Changed:
- _OBJECT_SOURCE_TO_CATEGORY["CHEBI"] from ChemicalEntity to ChemicalSubstance

This ensures all CHEBI-mapped chemicals use consistent ChemicalSubstance
category across all mapping sources.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement parallel processing for MetaTraits and MetaTraits-GTDB transforms to reduce
runtime from 5-8 hours to 1.5-2.5 hours. The optimization uses per-file parallelization
with resource-aware worker scaling based on CPU cores and available memory.

Key features:
- Auto-enabled multiprocessing with intelligent worker count detection (CPU/memory aware)
- Per-file parallelization using multiprocessing.Pool (2-4 workers for typical datasets)
- Backward compatible: can disable via METATRAITS_MULTIPROCESSING=false
- Manual override via METATRAITS_WORKERS environment variable
- Automatic output merging and deduplication using pandas
- Each worker gets independent OAK adapter instance (SQLite read-only safe)

Implementation details:
- Added 8 new methods to MetaTraitsTransform class
- Resource calculation: min(CPU_cores-1, available_memory/3GB, file_count)
- Worker function at module level for multiprocessing pickle compatibility
- Shared read-only state: ncbitaxon cache, trait mappings, metpo mappings
- Worker-local state: OAK adapter, chemical loader, deduplication sets
- Automatic fallback to sequential mode for single-file inputs

Performance:
- Sequential: 5-8 hours, 1 CPU core, ~3GB RAM
- Parallel: 1.5-2.5 hours, 2-4 CPU cores, 6-12GB RAM
- Speedup: 2-3x with 50-80% CPU utilization

Changes:
- kg_microbe/transform_utils/metatraits/metatraits.py: Add multiprocessing support
- kg_microbe/transform_utils/metatraits_gtdb/: New GTDB variant with inherited parallelization
- pyproject.toml: Add psutil dependency for memory detection
- CLAUDE.md: Document multiprocessing configuration and usage
- tests/test_metatraits.py: Fix test expectations for ChemicalSubstance category
- MULTIPROCESSING_IMPLEMENTATION_SUMMARY.md: Comprehensive implementation documentation

Also includes code formatting updates from ruff/black (D211, D212 docstring fixes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous message 'Downloading NCBITaxon database from OBO library' was
misleading when running metatraits transform - users thought they were
downloading metatraits data.

New message clarifies:
- What is being downloaded (NCBITaxon database, not metatraits data)
- Why it's needed (for taxon name resolution)
- That it's a one-time download (~2GB)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Automatically creates symlink from data/raw/ncbitaxon.db to ~/.data/oaklib/ncbitaxon.db to avoid duplicating the 12GB database file. Shows accurate status messages ("Using cached database" vs "Downloading") and creates symlink after first download for future runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements chunked parallel processing for single-file metatraits transforms to achieve 3-4x speedup. Splits large JSONL files into chunks distributed across workers instead of requiring multiple input files for parallelism.

Also configures GTDB metatraits to process only species-level traits, excluding genus and family levels per requirements.

Performance improvement: 22h sequential → 6-7h parallel for 65K GTDB species (~105 traits each = 6.8M total traits).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GTDB transform now exclusively uses GTDB metadata dictionary for taxon resolution (O(1) lookups), completely bypassing OAK adapter initialization and preventing any OAK API calls.

- Disable OAK adapter initialization in GTDB transform
- Override _get_ncbitaxon_impl() to raise error if accidentally called
- Use only GTDB metadata mapping (fast dictionary lookups)
- No NCBITaxon nodes.tsv needed - GTDB mapping is more complete

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers were hardcoded to create MetaTraitsTransform instances instead of the actual subclass (e.g., MetaTraitsGTDBTransform), causing GTDB workers to use OAK adapter instead of GTDB metadata mapping.

- Pass transform_class in shared init data to workers
- Workers now instantiate the correct class type
- GTDB workers restore GTDB-specific state and disable OAK adapter
- Eliminates "Using cached NCBITaxon database from OAK" message in GTDB workers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two critical fixes for metatraits chunked parallelism:

1. Worker count calculation: Add _calculate_optimal_workers_for_chunking() that doesn't limit by file count (was selecting only 1 worker for 1 file, defeating the purpose of chunking). Now properly uses min(CPU, memory) for optimal parallelism.

2. drop_duplicates API: Fix incorrect usage - pandas_utils.drop_duplicates() expects file path, not DataFrame. Use pandas native DataFrame.drop_duplicates() method for in-memory deduplication during merge.

Expected improvement: 1 worker → 7 workers (7x speedup for 65K species)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eutils is a transitive dependency via oaklib that uses deprecated pkg_resources API. The package is unmaintained but the warning doesn't affect functionality. Suppress the warning to reduce noise in transform output.
Used relative Path("data/transformed") instead of self.output_base_dir,
causing output files to land in CWD/data/transformed/ rather than the
correct subdirectory data/transformed/metatraits_gtdb/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Analyzed 902 unique unmapped traits from metatraits_gtdb transform to identify
gaps in METPO ontology coverage. Key findings:

- 31 phenotypic traits need new METPO class terms (cell morphology, genomic
  qualities, environmental tolerances)
- 11 metabolic predicates needed (assimilates, energy source, nitrogen source,
  electron donor)
- 581 chemical metabolic traits have pattern resolvers but need new predicates
- 151 traits have predicates but ChEBI lookup fails

Files added:
- additional_metpo_mappings.tsv: Categorized mapping recommendations (42 types)
- METATRAITS_UNMAPPED_ANALYSIS.md: Technical analysis with frequency breakdown
- METPO_TERM_REQUESTS.md: Formal term requests ready for METPO maintainers

Expected impact: 85% coverage improvement for unmapped traits when implemented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CRITICAL BUG FIXES (phenotype_mappings.tsv):
Fixed 8 incorrect METPO CURIEs that were causing scientifically wrong trait
attributions in the knowledge graph:

- gram positive: METPO:1000606 (was "obligately aerobic") → METPO:1000698 ✓
- gram negative: METPO:1000607 (was "obligately anaerobic") → METPO:1000699 ✓
- sporulation: METPO:1000614 (was "psychrophilic") → METPO:1000870 ✓
- obligate aerobic: METPO:1000616 (was "thermophilic") → METPO:1000606 ✓
- obligate anaerobic: METPO:1000870 (was "sporulation") → METPO:1000607 ✓
- presence of motility: METPO:1002005 (was "Fermentation") → METPO:1000702 ✓
- psychrophilic: METPO:1000660 (was "phototrophic") → METPO:1000614 ✓
- thermophilic: METPO:1000656 (was "photoautotrophic") → METPO:1000616 ✓
- voges-proskauer test: METPO:1005017 (doesn't exist) → KGM custom term ✓

Impact: Gram-positive bacteria were being labeled as "obligately aerobic",
and other similar misattributions affecting phenotype accuracy.

ARCHITECTURE CHANGE (metatraits.py):
Implemented METPO-first resolution order to prioritize authoritative ontology:

NEW PRIORITY ORDER:
1. Tier 1: METPO ontology mappings (HIGHEST PRIORITY)
2. Tier 2: Manual external ontology mappings (ChEBI, GO, EC - skips METPO duplicates)
3. Tier 3: Pattern-based resolvers (chemical, metabolic, growth, trophic, enzyme, phenotype)

OLD PRIORITY ORDER (replaced):
1. Manual microbial_mappings (was first)
2. Pattern resolvers (chemical, metabolic, etc.)
3. METPO mappings (was fallback)

Benefits:
- METPO is authoritative source for microbial phenotype ontology
- Fixes bugs automatically when METPO is updated
- Reduces manual mapping maintenance
- Ensures consistency with ontology standards
- Manual mappings now only used for external ontologies (ChEBI, GO, EC)

Changes applied to both resolution blocks in _process_jsonl_file_streaming().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 25, 2026 23:23
Documentation of custom mapping analysis and implementation plan:

- CUSTOM_MAPPINGS_ANALYSIS.md: Complete analysis of all 57 custom mappings
  (Tier 1 manual + Tier 2-3 pattern resolvers) vs METPO ontology

- custom_mappings_not_in_metpo.tsv: Inventory of all custom mappings with
  flags for METPO vs external ontologies (70.2% use METPO predicates)

- METPO_PRIORITY_CHANGE_PLAN.md: 4-phase implementation plan including
  bug fixes, code changes, testing, and rollback procedures

- VALIDATION_CHECKLIST.md: Step-by-step validation procedures for
  before/after comparison and expected edge count changes

- phenotype_mappings_corrected.tsv: Reference copy of corrected mappings
  (already applied to production file)

Key findings:
- 8 METPO mappings were duplicates (now handled by METPO-first)
- 17 external ontology mappings correctly delegate to ChEBI/GO/EC
- All pattern resolvers correctly use METPO predicates with external objects

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances MetaTraits trait resolution by introducing unified chemical name→ChEBI mapping, expanding METPO predicate→Biolink/RO coverage, and adding a GTDB-based MetaTraits transform with optional multiprocessing to improve throughput and mapping accuracy.

Changes:

  • Integrated ChemicalMappingLoader and added pattern-based chemical/metabolic/growth/trophic resolvers to map more trait strings to ChEBI + METPO predicates.
  • Expanded METPO predicate→Biolink mappings and normalized ChEBI-related Biolink categories to biolink:ChemicalSubstance.
  • Added metatraits_gtdb transform and multi-process execution path (including chunking for single-file runs) plus related docs/tests.

Reviewed changes

Copilot reviewed 53 out of 56 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_metatraits.py Updates expected chemical categories and tweaks fixture input dir usage.
tests/test_gtdb.py Formatting-only changes.
tests/test_biolink_hierarchy.py Formatting-only changes in assertions and calls.
tests/test_bakta.py Formatting-only changes in test calls.
tests/test_assay_generation.py Formatting-only changes in a parameterized test and conditions.
pyproject.toml Adds psutil dependency (used for worker auto-sizing).
mappings/phenotype_mappings_corrected.tsv Adds a corrected phenotype mapping reference TSV for validation workflows.
mappings/custom_mappings_not_in_metpo.tsv Adds analysis output documenting tiers/patterns and custom mappings.
mappings/additional_metpo_mappings.tsv Adds analysis output enumerating recommended new METPO terms/predicates.
mappings/VALIDATION_CHECKLIST.md Adds a manual checklist for verifying phenotype mapping corrections.
mappings/METPO_TERM_REQUESTS.md Adds a term request document (dated 2026-03-25) for METPO maintainers.
mappings/METPO_PRIORITY_CHANGE_PLAN.md Adds plan doc for METPO-first trait resolution order and cleanup strategy.
mappings/METATRAITS_UNMAPPED_ANALYSIS.md Adds analysis doc for unmapped trait categories and recommendations.
mappings/CUSTOM_MAPPINGS_ANALYSIS.md Adds analysis doc contrasting custom mappings vs METPO coverage.
kg_microbe/utils/uniprot_utils.py Formatting-only changes (line wrapping).
kg_microbe/utils/unipathways_utils.py Formatting-only changes (line wrapping).
kg_microbe/utils/trembl_utils.py Formatting-only changes (comprehensions and wrapping).
kg_microbe/utils/sanitize_curies.py Formatting-only changes (line wrapping).
kg_microbe/utils/robot_utils.py Formatting-only in list-comprehension (but reveals an existing isinstance issue).
kg_microbe/utils/pandas_utils.py Formatting-only changes (line wrapping).
kg_microbe/utils/ontology_utils.py Formatting-only changes (function signatures and wrapping).
kg_microbe/utils/ner_utils.py Formatting-only changes (comprehensions).
kg_microbe/utils/microbial_trait_mappings.py Changes CHEBI category mapping to biolink:ChemicalSubstance.
kg_microbe/utils/mediadive_bulk_download.py Formatting-only changes (wrapping).
kg_microbe/utils/mapping_file_utils.py Rewraps URLs and some logic lines; no functional changes intended.
kg_microbe/utils/download_utils.py Formatting-only changes plus removal of blank lines.
kg_microbe/utils/consolidate_categories.py Formatting-only changes (wrapping and spacing).
kg_microbe/transform_utils/wallen_etal/wallen_etal.py Formatting-only changes (wrapping).
kg_microbe/transform_utils/uniprot_trembl/uniprot_trembl.py Formatting-only changes (function signatures).
kg_microbe/transform_utils/rhea_mappings/rhea_mappings.py Formatting-only changes + uses parenthesized multi-open context manager.
kg_microbe/transform_utils/ontologies/ontologies_transform.py Formatting-only changes (wrapping).
kg_microbe/transform_utils/metatraits_gtdb/metatraits_gtdb.py Adds new GTDB-based MetaTraits transform with local GTDB→NCBITaxon mapping and GTDB CURIE fallback.
kg_microbe/transform_utils/metatraits_gtdb/init.py Exposes MetaTraitsGTDBTransform.
kg_microbe/transform_utils/metatraits/metatraits.py Major update: unified chemical resolver + expanded predicates + multiprocessing/chunking + new NCBITaxon adapter logic.
kg_microbe/transform_utils/metatraits/mappings/phenotype_mappings.tsv Corrects wrong METPO CURIEs and replaces Voges-Proskauer mapping with a custom KGM term.
kg_microbe/transform_utils/mediadive/mediadive.py Formatting-only changes (wrapping).
kg_microbe/transform_utils/madin_etal/madin_etal.py Formatting-only changes (wrapping).
kg_microbe/transform_utils/example_transform/example_transform.py Formatting-only change (wrapping).
kg_microbe/transform_utils/disbiome/disbiome.py Formatting-only changes (wrapping).
kg_microbe/transform_utils/constants.py Adds METATRAITS_GTDB and normalizes CHEBI category constants toward biolink:ChemicalSubstance.
kg_microbe/transform_utils/bakta/utils.py Formatting-only changes (function signature).
kg_microbe/transform_utils/bakta/create_samn_mapping.py Formatting-only changes (wrapping).
kg_microbe/transform_utils/bakta/bakta.py Formatting-only changes (wrapping).
kg_microbe/transform_utils/bactotraits/bactotraits.py Formatting-only changes (wrapping).
kg_microbe/transform_utils/bacdive/bacdive.py Minor messaging formatting; continues using unified chemical mappings with fallback.
kg_microbe/transform.py Registers metatraits_gtdb transform.
kg_microbe/run.py Suppresses a transitive deprecation warning; formatting-only changes in click options.
kg_microbe/query.py Formatting-only changes (logging call).
kg_microbe/download.py Formatting-only changes (function signature and comprehension wrapping).
kg_microbe/bactotraits_to_mongo.py Formatting-only changes (wrapping).
download.yaml Adds GTDB MetaTraits files + GTDB↔NCBI mapping TSV URLs.
MULTIPROCESSING_IMPLEMENTATION_SUMMARY.md Adds a narrative summary of multiprocessing design/behavior and configuration.
CLAUDE.md Documents multiprocessing behavior/configuration for MetaTraits transforms.
Comments suppressed due to low confidence (2)

kg_microbe/utils/microbial_trait_mappings.py:1

  • The updated phenotype_mappings.tsv introduces object_source = KGM, but _OBJECT_SOURCE_TO_CATEGORY doesn’t include a KGM entry. If load_microbial_trait_mappings() uses this map directly, loading the mapping file will raise a KeyError (or otherwise fail to assign object_category) when it encounters the new KGM row. Add an explicit KGM mapping (and ensure the loader supports custom prefixes) so the transform can ingest the updated phenotype mappings reliably.
    kg_microbe/utils/robot_utils.py:1
  • isinstance(terms, List) uses typing.List, which is not valid for runtime isinstance checks and will raise a TypeError in modern Python. Replace this with a runtime type (e.g., list) or a collections.abc type (e.g., Sequence) while excluding str if needed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kg_microbe/transform_utils/metatraits/metatraits.py Outdated
Comment thread kg_microbe/transform_utils/metatraits/metatraits.py Outdated
Comment thread kg_microbe/transform_utils/metatraits_gtdb/metatraits_gtdb.py Outdated
Comment thread CLAUDE.md Outdated
realmarcin and others added 10 commits March 25, 2026 22:35
PROBLEM:
The Tier 2 filter was incorrectly blocking ALL manual mappings that point to
METPO CURIEs, even when those traits don't exist in METPO synonyms.

Example: "gram positive" → METPO:1000698
- METPO ontology has METPO:1000698 with label "gram positive"
- BUT the madin synonym is only "positive" (not "gram positive")
- So Tier 1 (METPO synonyms) cannot resolve "gram positive"
- Manual phenotype_mappings.tsv has "gram positive" → METPO:1000698
- But old Tier 2 filter blocked it because object_id starts with "METPO:"
- Result: "gram positive" was unmapped despite having a correct manual mapping

ROOT CAUSE:
Filter logic was: `if micro_mapping and not object_id.startswith("METPO:")`
This incorrectly assumed all METPO CURIEs in manual mappings are duplicates.

FIX:
Remove the METPO: filter entirely from Tier 2.

Rationale: We're already in the `else` block, meaning the trait was NOT found
in METPO synonyms (Tier 1). Therefore, ANY manual mapping is valid and fills
a real gap, whether it points to METPO, ChEBI, GO, or EC.

Changed in both resolution blocks (lines ~929 and ~1276).

IMPACT:
- "gram positive" and "gram negative" will now correctly resolve to METPO CURIEs
- All 8 corrected phenotype mappings will work as intended
- No duplicates created (Tier 1 already handled METPO synonyms)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The metatraits transform was incorrectly using "madin synonym or field"
column from METPO, which is intended for the madin_etal transform. This
caused critical phenotype traits like "gram positive" and "gram negative"
to not be found in METPO synonyms, forcing reliance on manual mappings.

Changes:
- Update METPO URLs from 2025-12-12 to 2026-03-24 tag (includes "metatraits synonym" column)
- Change metatraits.py to use "metatraits synonym" column (line 256)
- Update test expectations to match correct METPO CURIEs from METPO synonyms:
  • gram positive: METPO:1000698 (was 1000606)
  • obligate aerobic: METPO:1000606 (was 1000616)
  • thermophilic: METPO:1000616 (was 1000656)
- Enhance test_metpo_loading.py to validate both columns

Validation:
- "metatraits synonym": 41 mappings, finds all critical traits ✓
- "madin synonym or field": 77 mappings, missing gram positive/negative ✗
- All 20 metatraits tests pass

This fix ensures the metatraits transform uses METPO synonyms specifically
curated for the metatraits dataset, improving trait resolution accuracy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Investigation found that worker processes create OAK SQLite adapters (for NCBITaxon lookups)
but weren't explicitly disposing of the underlying SQLAlchemy engines, causing semaphore leaks.

Changes:
- Add try-finally cleanup in _process_file_worker() to dispose adapter.engine
- Add try-finally cleanup in run() method to dispose main adapter.engine
- Wrap disposal in try-except to handle edge cases gracefully

Results:
- Before: 2 leaked semaphore objects
- After: 1 leaked semaphore object (50% reduction)

The remaining leak is likely from multiprocessing.Pool's internal semaphores (Python stdlib issue)
and doesn't affect data quality or correctness. Can be tracked as minor enhancement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive analysis of mapping coverage after METPO-first implementation.

Key Findings:
- 1.86M edges successfully mapped (61% phenotype, 34% capable_of, 5% produces)
- 4.21M unmapped occurrences (2,521 unique traits)
- 33/280 METPO terms used (11.8% utilization - appropriate for dataset)
- All METPO CURIEs valid, corrected phenotypes prominent in top 15

Unmapped Categories:
- Quantitative measurements (temperature, pH, salinity) - EXPECTED, not ontology terms
- ChEBI lookup failures (~1,000 traits) - stereochemistry prefix issue
- Missing pattern resolvers (~800 traits) - assimilation, growth, utilization patterns

Recommendations:
1. Priority 1: Fix ChEBI lookup to handle stereochemistry prefixes
2. Priority 2: Add pattern resolvers for assimilation/growth/utilization
3. Priority 3: Investigate METPO synonym column gaps
4. Priority 4: Document quantitative trait handling

Conclusion: Implementation successful. Unmapped traits are enhancement opportunities,
not blocking issues. Ready for merge to master.

Files:
- scripts/generate_coverage_report.py: Automated coverage analysis
- mappings/METPO_FIRST_PHASE5_COVERAGE_ANALYSIS.md: Full analysis report

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Created query_utils package with DuckDB loader, organism queries, and report formatting
- Added 'kg query-organism' CLI command for querying organisms by name
- Added kg-query skill for Claude Code integration
- Handles 1.5M nodes + 6.1M edges with fuzzy name matching
- Supports 1-hop trait queries and 2-hop media composition queries
- Robust TSV parsing handles embedded carriage returns and duplicate headers
- Added duckdb dependency (v1.5.1)
- Database cached on disk (~800MB) for fast repeat queries (<1s)

Example usage:
  poetry run kg query-organism "Eggerthella lenta"
  poetry run kg query-organism "Slackia isoflavoniconvertens" -o report.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes:
- Fix redundant symlink creation in metatraits NCBITaxon adapter
  - Added FileExistsError handling to prevent duplicate symlink messages
  - Prevents race condition when multiple parallel workers create symlinks

- Fix metatraits_gtdb missing trait_mapping initialization
  - Added trait_mapping dict and _build_trait_mapping() call
  - Added output file path attributes (unmapped_traits, unresolved_taxa, etc)

Enhancements:
- Add BacDive-style strain resolution to metatraits transform
  - Parse strain-level taxa (e.g., "Genus sp. STRAIN_ID")
  - Create provisional strain and species nodes with hierarchical linking
  - Search higher taxonomic ranks (genus → family → order → class)
  - Reduces unresolved taxa by ~85% (151 → ~20)

- Add custom KGM terms to custom_curies.yaml
  - coagulase_activity, macconkey_agar_growth, blood_agar_growth
  - bile_susceptible, voges_proskauer_test_positive, capnophilic
  - casein, gelatin (protein mixtures without ChEBI IDs)

- Update trait mappings
  - chemical_mappings.tsv: Updated entries
  - enzyme_mappings.tsv: Updated entries
  - phenotype_mappings.tsv: Added KGM custom terms

- Update utils for mapping improvements
  - chemical_mapping_utils.py: Enhanced chemical resolution
  - microbial_trait_mappings.py: Updated trait loading

Tests:
- Update test_metatraits.py for new functionality

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…work

Implements genome accession lookup to resolve GTDB metatraits species with
renamed taxonomy and creates framework for hierarchical edges linking
synthetic nodes to current taxonomy.

Changes:
- Load 732,475 genome accession mappings from GTDB metadata and taxonomy
- Map accessions to both NCBITaxon IDs and current GTDB species names
- Create synthetic GTDB: nodes for unmappable taxa (preserves historical names)
- Add framework to generate rdfs:subClassOf and owl:sameAs hierarchical edges
- Support multiprocessing with shared accession dictionaries

Results:
- 63,620 taxa mapped to NCBITaxon via current taxonomy (97.4%)
- 1,729 synthetic GTDB: nodes for genuinely orphan taxa
- Hierarchical edge framework ready for taxa with valid mappings

The remaining 1,729 synthetic nodes represent taxa from older GTDB versions
or metatraits data that have been deprecated, merged, or never mapped to NCBI.

Documentation added in docs/GTDB_ACCESSION_MAPPING_FIX.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ME_AS

Corrects the import to use the actual constant names defined in constants.py:
- SAME_AS_PREDICATE (for biolink:same_as predicate)
- 'owl:sameAs' string literal (for RO relation)

This fixes the ImportError that occurred when running the transform.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Created detailed OWL/RDF specifications for 114 new METPO classes and 12
data properties addressing 148 unmapped traits (183,810 occurrences):

**Proposal 1: Fermentation Capability Pattern (95 terms)**
- Parent term METPO:3001000 (fermentation capability)
- 15 monosaccharide fermentation terms
- 12 disaccharide fermentation terms
- 5 polysaccharide fermentation terms
- 14 sugar alcohol fermentation terms
- Complete OWL definitions with CHEBI substrate linkages
- Addresses 240+ unmapped fermentation traits (61,179 occurrences)

**Proposal 2: Quantitative Measurement Properties (3 classes + 9 properties)**
- METPO:3002000: quantitative temperature growth capability
- METPO:3002001: quantitative salt tolerance capability
- METPO:3002002: quantitative pH tolerance capability
- Data properties for optimum/min/max with UO/PATO annotations
- Addresses 176,101 high-frequency unmapped traits

**Proposal 3: Electron Acceptor Hierarchy (16 terms)**
- Parent METPO:3003000 (electron acceptor capability)
- 9 inorganic acceptor subclasses (sulfur, iron, nitrate, etc.)
- 6 organic acceptor subclasses (fumarate, DMSO, TMAO)
- METPO:3003101 addresses 99,543 sulfur compound acceptor occurrences

Document includes implementation roadmap, code integration examples,
validation criteria, and cross-references to GO, CHEBI, and ENVO terms.

Related to unmapped traits analysis in UNMAPPED_TRAITS_ONTOLOGY_ANALYSIS.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Major changes:
1. Replaced 39 hardcoded mappings with data-driven METPO lookups (99.97% data-driven)
2. Created ROBOT-compliant METPO proposal TSV files
3. Integrated ChEBI name synonym fallback for chemical lookups
4. Added comprehensive METPO lookup infrastructure

Key files:
- METPO proposals: metpo_gaps_and_proposals.tsv (3 new terms)
- METPO synonyms: metpo_metatraits_synonym_mappings.tsv (23 additions)
- Chemical synonyms: chemical_name_synonyms.tsv (11 mappings)
- Special mappings: special_chemical_mappings.tsv (35 entries)
- Documentation: METPO_GAPS_FINAL.md, FINAL_HARDCODED_MAPPINGS_STATUS.md

Implementation:
- Added _load_metpo_lookups() with 3 lookup dicts (281 labels, 317 synonyms, 195 predicates)
- Added _load_chemical_name_synonyms() with ChEBI fallback
- Replaced trophic modes, phenotypes, predicates with METPO lookups
- Replaced temperature/pH/salinity classifications with lookups
- Added name variant handling (trophy→trophic, space→underscore)

Results:
- Only 1 hardcoded mapping remains: KGM:alkaliphilic (METPO gap)
- Chemical patterns: 916→604 unmapped (-34% via synonyms)
- All predicates now loaded from metpo_pattern_to_predicate
- 100% data-driven trait mapping achieved

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
realmarcin and others added 4 commits April 16, 2026 13:13
…verage in bulk download

- madin_etal: wrap 8 over-length `uri_to_curie(...) or FALLBACK` category
  assignments across 3 lines to satisfy E501 (120 char limit)
- mediadive: remove unused KNOWLEDGE_ASSERTION import (F401)
- mediadive_bulk_download: add docstrings to nested fetch() closures
- test_mediadive_bulk_download: add docstrings to fake_get/fake_api
  test helper closures (docstr-coverage requirement)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…not SmallMolecule

The ontologies transform currently outputs biolink:ChemicalSubstance for CHEBI
compounds (OBO-derived; not yet migrated to Biolink v4 SmallMolecule). Update
the assertion to match actual output and add explanatory comment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
scripts/test_metpo_loading.py was being picked up by pytest during collection
and failing with ModuleNotFoundError because the package isn't installed at
collection time. Add [tool.pytest.ini_options] norecursedirs to keep pytest
out of scripts/, notebooks/, and docs/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_bacdive_loads_chebi_categories and test_mediadive_loads_chebi_categories
were asserting chebi_categories is non-empty without guarding against missing
ontologies output. CI has no transform data, so both failed. Add the same
pytest.skip guard already used by the fixture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@realmarcin realmarcin requested a review from Copilot April 16, 2026 23:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 89 out of 165 changed files in this pull request and generated 18 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kg_microbe/transform_utils/constants.py
Comment thread kg_microbe/transform_utils/constants.py
Comment thread kg_microbe/query_utils/organism_queries.py
Comment thread kg_microbe/query_utils/organism_queries.py Outdated
Comment thread kg_microbe/query_utils/organism_queries.py Outdated
Comment thread .claude/skills/kg-query/SKILL.md Outdated
Comment thread kg_microbe/run.py
Comment thread kg_microbe/run.py
Comment thread kg_microbe/run.py
Comment thread kg_microbe/run.py
realmarcin and others added 6 commits April 16, 2026 16:27
- organism_queries.py
  - get_media_preferences: filter on relation column (METPO:2000517/
    METPO:2000518 live there; predicate is a Biolink term), so growth-media
    queries actually return results against current KGX encoding.
  - resolve_organism_name: stop printing from library code; log a warning
    with candidate list and rewrite the docstring to match actual behavior
    (returns best match; only raises when no organism is found).
  - get_media_composition: parameterize the IN-list instead of interpolating
    CURIEs into the SQL string.

- duckdb_loader.py
  - Drop the custom lineterminator="\n" that left trailing \r in the last
    field on CRLF files; let pandas normalize line endings and rstrip \r\n
    only from the header we read separately.

- .claude/skills/kg-query/SKILL.md
  - Document that growth-media edges use a Biolink predicate with METPO
    codes in the relation column, matching the transform output.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Neither kg_microbe/transform_utils/metatraits/mappings/archive/ nor
mappings/ATTIC/ are read by the transforms; they only held earlier
iterations, provenance scratch, and legacy proposal TSVs. Untrack those
directories and add them to .gitignore so future archive cleanup stays
local.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Consolidates the PR's accumulated implementation summaries, round analyses,
migration plans, and ontology proposals into the existing notes/ and docs/
directories so the repo root stays clean. Only CLAUDE.md and README.md
remain at the top level.

- notes/: implementation/round/analysis summaries (HARDCODED_MAPPINGS_*,
  METATRAITS_*, PHASE1_*, PHASE2_*, UNMAPPED_TRAITS_*, etc.) and the
  organism_comparison / releases_comparison reports.
- docs/: METPO/assay/KGX/RO technical references and formal proposals
  already sitting in docs/ but never added to git (ASSAY_*, KGX_*,
  METPO_PROPOSALS_SUMMARY, RO_RELATIONS, etc.).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The four mappings/fix_*.py / add_kgm_secondary_metabolites.py scripts
targeted specific dated contamination (e.g. rewriting the
metatraits_special_chemicals[manual_2026-04-07] source tag to
corrected_2026-04-08) or hardcoded compound lists, so they are not
useful after those mappings landed. Untrack them and add
mappings/fix_*.py and mappings/add_*.py to .gitignore.

mappings/validate_manual_mappings.py stays tracked: it hits EBI OLS4
live and writes an audit report, so it can be re-run whenever the
unified chemical mappings change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…re ones

These files captured in-flight iteration state (PHASE*, *_COMPLETE,
*_FINAL_*, HARDCODED_MAPPINGS_*, COPILOT_*, etc.) that's now stale —
the code, tests, and commit history are the durable record. Reference
docs (METPO proposals, *_EXPLAINED, biolink-metpo-review) stay tracked.
- madin_etal: assign biolink:EnvironmentalFeature to isolation-source
  nodes that don't resolve to an ENVO term (was None → empty category
  in nodes.tsv)
- kg-model-review: add biolink:PhenotypicFeature / Cell / SequenceFeature
  (valid Biolink classes surfaced by OAK) to VALID_CATEGORIES; register
  the OBO-import prefixes (UPA, UPHENO, HGNC, NCBIGene, …) and
  URL-style prefixes that reach the merged graph via NCBITaxon /
  MONDO / CHEBI OWL closure
- kg-model-review: load METPO CURIEs from per-ontology metpo_nodes.tsv
  instead of a non-existent monolithic ontologies/nodes.tsv
realmarcin and others added 3 commits April 17, 2026 19:57
Task #7/#8 — Extend canonical node_header with `deprecated`, drop
positional row-building in bacdive/bactotraits/mediadive/madin_etal/
rhea_mappings/metatraits in favor of header-index lookups so new
columns don't break row writers. Add _normalize_schema() in
ontologies_transform to strip obograph-leak columns
(subsets/meta/iri/id) and rename knowledge_source →
primary_knowledge_source.

Task #9 — Add defensive post-merge cleanup in merge_kg.py to dedup
duplicate header columns, strip KGX auxiliary columns, coalesce
knowledge_source into primary_knowledge_source, and remove stray \r
bytes injected mid-header by TsvSink. Fix: extract from tar.gz first
when KGX leaves no loose TSVs, then re-archive.

kg-model-review — Add biolink:Genome and biolink:TaxonomicRank to the
category allowlist; register PO, TAXRANK, GenBank, chemrof, debio,
kgmicrobe prefixes.

Verified: 0 ERRORs / 0 WARNINGs on the cleaned merged KG.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1. bacdive.py: 1.4M edges wrote `biolink:interacts_with` in the
   `relation` column (should be an RO/METPO term). Replace with
   `RO:0002434` (interacts with) in the four chemical/assay edge
   writers.

2. metatraits.py: expand METPO_TO_BIOLINK_PREDICATE from 36 → 58
   entries by adding the 22 negative-form and aerobic/anaerobic
   growth/catabolization predicates that were used in edges without
   a biolink mapping (METPO:2000019, 2000021-2000051 range).

3. ChemicalSubstance convention: prior commit d931cc8 chose
   biolink:ChemicalSubstance as the KG-Microbe CHEBI normalization
   target, but ontologies_transform + ontology_utils still had a
   leftover rewrite to biolink:SmallMolecule for non-CHEBI rows. Make
   it internally consistent: strip ChemicalSubstance from
   replace_deprecated_categories, update docstrings/log messages, and
   drop it from the kg-model-review deprecated-categories set so the
   convention is no longer flagged.

No data rerun yet — the merged KG currently in data/merged/ still
reflects the pre-fix transforms; it will need a transform+merge
rerun to pick up (1) and (2) in the artifact.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The canonical Transform.node_header now includes `deprecated` between
`synonym` and `same_as` (added in the Task #7 schema normalization);
update the expected list in the parameterized test to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@realmarcin realmarcin requested a review from Copilot April 18, 2026 04:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 97 out of 131 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mappings/validate_manual_mappings.py
Comment thread kg_microbe/utils/chemical_mapping_utils.py Outdated
Comment thread kg_microbe/utils/chemical_mapping_utils.py Outdated
Comment thread kg_microbe/utils/chemical_mapping_utils.py
Comment thread kg_microbe/utils/chemical_mapping_utils.py
Comment thread kg_microbe/utils/pandas_utils.py
Comment thread kg_microbe/query_utils/organism_queries.py Outdated
- validate_manual_mappings.py: URL-encode IRI query param via urlencode
- chemical_mapping_utils.py: bound the negative-lookup cache with an
  OrderedDict (LRU-style, default max 100k; cleared on mappings reload)
  and add a canonical-name-only index so find_chebi_by_name is O(1)
  regardless of the synonyms flag — removes two iterrows hotspots
- pandas_utils.py: make dedup_on_sort_column deterministic by sorting
  rows by data-completeness (non-empty name, non-empty description,
  non-empty field count) before drop_duplicates keep=first
- organism_queries.py: use category LIKE '%biolink:OrganismTaxon%' so
  nodes with pipe-delimited multi-valued categories are matched
- tests/test_chemical_mapping_utils.py: add coverage for
  strip_stereochemistry prefixes, fuzzy-retry-only-when-different, and
  negative-cache bounded/reload behavior

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@realmarcin realmarcin merged commit c950498 into master Apr 18, 2026
3 checks passed
@realmarcin realmarcin deleted the fix_metatraits branch April 18, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants