Skip to content

Conversation

@bruAristimunha
Copy link
Collaborator

@bruAristimunha bruAristimunha commented Jan 21, 2026

Summary

  • Replace dataset categorization with tag structure and update docs/summary tables; use computed titles from API.
  • Add dataset feedback section with GitHub issue button and doc build fixes/parallelization.
  • Improve chart rendering (ridgeline modality toggle, clinical breakdown colors, growth/scale fixes, DOI date fallback).
  • Add/refresh ingestion and parsing utilities (VHDR/SNIRF/MEF3 parsers, validation checks, MEG metadata fixes; consolidated plot_dataset helpers).
  • Standardize bids_relpath as canonical key and improve dataset utilities (registry caching gate, download of dataset-level files).
  • Align ntimes to sample counts, derive from sidecar duration/sfreq, and update metadata duration math and tests.
  • Fix complexity SVD entropy for numpy versions without np.linalg.svdvals.
  • Refactors/maintenance: remove deprecated Paradigm, pin pandas<2.2, pre-commit and PEP8/nested-function cleanups, color mapping cleanup, and revert NEMAR CLI support to keep a GitHub-only approach.

Test Plan

  • NUMBA_DISABLE_JIT=1 python -m pytest tests/unit_tests/dataset/test_base.py tests/unit_tests/dataset/test_bids_dataset.py tests/unit_tests/dataset/test_dataset.py tests/unit_tests/dataset/test_dataset_maturity.py (111 passed, 1 skipped)
  • NUMBA_DISABLE_JIT=1 python -m pytest tests/unit_tests/features/feature_bank/test_complexity.py (9 passed)
  • NUMBA_DISABLE_JIT=1 python -m pytest tests/integration/test_benchmarks.py::TestMetadataAccessPerformance::test_num_times_query (skipped: dataset missing at .eegdash_cache/ds005509-bdf-mini)

bruAristimunha and others added 30 commits January 8, 2026 13:54
Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com>
Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com>
Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>
Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com>
Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>
…d/html

- Fix artifact upload path
- Fix surge preview deployment path
- Fix gh-pages deployment for both main and develop branches
- Resolves documentation deployment failure
Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com>
Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>
- Added interactive growth plots with consistent modality colors
- Added clinical breakdown stacked bar charts
- Updated dataset summary page with new tabs
- Added color definitions for new modalities (EEG, fNIRS, etc)
- Fixed linting issues (E701, F841)
- Add BIDS inheritance fallback for MEG sidecar files (searches parent
  directories and handles run/acq entity variations)
- Prefer channel_labels count from channels.tsv over partial sidecar counts
- Add CTF .ds directory detection for has_actual_files check
- Fix bug where 'name' variable was overwritten in root file scan loop
- Add data quality validation for nchans and sampling_frequency fields
- Make validation run automatically before injection (--skip-validation to bypass)
- Add --data-quality-threshold parameter to control acceptable missing data percentage
- Create _validate.py module for reusable validation functions
- Add data quality statistics to validation summary output
- Track datasets with missing channel counts or sampling frequency
- Add Tags TypedDict with pathology, modality, and type fields
- Update Dataset TypedDict to include tags field
- Update create_dataset function to accept tags_* parameters
- Update registry to consume tags from API with fallback to legacy clinical/paradigm
- Deprecate clinical and paradigm fields in favor of tags
- Update main_from_json to read from tags.pathology, tags.modality, tags.type
- Add fallback to legacy clinical/paradigm fields for backwards compatibility
- Update prepare_table to handle direct tags columns from API
- Add Description section with README content to dataset pages
- Convert markdown/RST headers to bold text to avoid document structure issues
- Handle box-style headers (em-dashes) in README content
- Use computed_title from API for proper dataset titles
- Use nchans_counts/sfreq_counts from datasets collection
- Add collapsible dropdown for long READMEs (>30 lines)
- Update plot scripts to use new tags structure for dataset classification
- Improve clinical breakdown, growth, sankey, and treemap visualizations
- Update prepare_summary_tables for tags-based data extraction
- Add _nemar.py helper module with CLI subprocess wrappers
- Update nemar.py to use CLI as primary method with GitHub fallback
- Add --use-cli and --use-github flags for method selection
- Update fetch-source.yml workflow with Bun/NEMAR CLI setup options
- Update 1-fetch-nemar.yml to enable CLI installation

The NEMAR CLI (nemar-cli via Bun) provides direct access to the NEMAR API
for listing datasets. GitHub API fallback is maintained for compatibility.
Currently using --skip-bids as GitHub repos don't exist yet.
- Update validate_record to only require bids_relpath (not bidspath)
- Add documentation explaining bidspath is computed from bids_relpath
- Document flatten_entities conflict resolution logic
- Document fingerprint fallback chain (storage.raw_key -> bids_relpath)
- Add toggle buttons to switch between Experiment Modality and Recording Modality views
- Extract trace building logic into reusable _build_ridgeline_traces function
- Add primary_recording_modality utility for canonical recording modality labels
- Filter out datasets with 0 or negative participants to avoid log10 issues
- Convert numpy arrays to Python lists in ridgeline chart to avoid
  Plotly binary encoding issues
- Add resting_state canonical mapping for consistency
- Add distinct colors for all pathology types (epilepsy, depression, alzheimer, etc.)
- Include both title case and lowercase variants for color matching
- Handle NaN/nan strings in modality and population_type columns
- Filter out unknown modalities from the chart
- Fix plot scaling by using explicit CSS pixel heights instead of height: 100%
- Increase plot heights to 1000px for better visibility (growth, clinical_breakdown)
- Make colors consistent across Sankey and Clinical Breakdown charts
- Use PATHOLOGY_PASTEL_OVERRIDES for all clinical conditions
- Improve Unspecified Clinical color visibility (#fda4af)
- Add fetch_chart_data_from_api() for optimized chart generation
- Simplify fetch_datasets_from_api() to single API call (stats embedded)
- Add parallel chart generation in prepare_summary_tables.py
- Add duration_hours_total field for treemap compatibility
- Add .env.example with environment variable documentation
- Add CLAUDE.md with project guide
- Add api_helper.py CLI tool for common API operations
@bruAristimunha bruAristimunha force-pushed the improving-the-description branch from b6f97fa to 1e2f3e2 Compare January 22, 2026 11:33
- Add DOI resolution fallback for publication dates in openneuro.py and nemar.py
- Simplify prepare_summary_tables.py (remove JSON mode, ~700 lines reduced)
- Filter EEG2025* datasets from chart data (HBN competition datasets)
- Fix empty "Type Subject" → "Unknown" in registry.py
- Adjust chart heights and margins for better rendering:
  - Sankey: height 900→1100, margins adjusted
  - Clinical: height 1000→650, bottom margin increased
  - Growth: height 1000→550, bottom margin increased
- growth.py: Move _normalize_modality to module level
- prepare_summary_tables.py: Move _normalise_tag and _strip_unknown to module level
- registry.py: Move _make_dataset_init, _clean_optional, and _clean_or_unknown to module level
Move nested functions to module level across the codebase:

- eegdash/features/inspect.py: Extract _is_feature, _is_feature_extractor,
  _is_feature_preprocessor, _is_feature_kind predicates
- docs/source/conf.py: Extract _stat_line, _is_decorative_line helpers
- docs/plot_dataset/ridgeline.py: Extract _convert_arrays helper
- scripts/create_metadata.py: Extract _s3_size_worker for parallel S3 queries
- scripts/ingestions/5_inject.py: Extract _sanitize_for_json and
  _inject_records_batch for parallel record injection
- scripts/ingestions/_file_utils.py: Extract _fetch_scidb_path and
  _propfind_datarn for recursive file listing

Add pre-commit hook (scripts/check_nested_functions.py) to prevent
nested function definitions in future code.
Add helper function _create_color_map_with_aliases() that automatically
generates lowercase, UPPERCASE, and Title Case variants for each color key.

Refactor PATHOLOGY_PASTEL_OVERRIDES from 38 lines of manual duplication
to a clean 22-line base dict with auto-generated aliases. This makes
adding new colors a single-line change instead of 2-3 lines.
Reverts the NEMAR CLI changes from 7d215d0. Will revisit later.
Move inline imports from inside functions to the top of modules
to follow PEP8 guidelines and improve code clarity.
Add fallback metadata extraction from neurophysiology file formats:
- VHDR parser: Extract sampling_frequency, nchans, ch_names from BrainVision headers
- SNIRF parser: Extract metadata from fNIRS files using MNE or h5py fallback
- MEF3 parser: Extract metadata from MEF3 directory structures

These parsers provide metadata extraction when BIDS sidecar files are missing.

Also includes comprehensive test suite for VHDR parser functionality.
Remove the Paradigm TypedDict and related parameters from create_dataset.
Use the tags field with modality and type keys instead.
wfdb has a bug with pandas ArrowStringArray that causes import
failures. Pin pandas to <2.2 until wfdb releases a fix.
Add TestSchemaContract class with pytest.parametrize for automatic
testing of all Pydantic/TypedDict pairs:

- PYDANTIC_TYPEDDICT_PAIRS: exact field match (Storage, Entities)
- PYDANTIC_SUBSET_PAIRS: Pydantic subset of TypedDict (Record, Dataset)
- DOCUMENTED_TYPEDDICTS: all TypedDicts must have docstrings
- REMOVED_SCHEMAS: verify deprecated schemas are gone

Also fixes schema mismatch: adds missing ingestion_fingerprint field
to Dataset TypedDict (was only in DatasetModel).

To add new schemas, just add them to the lists and they're auto-tested.
- Add shared utilities to docs/plot_dataset/utils.py:
  - normalize_modality_string() for consistent modality formatting
  - detect_modality_column() for DataFrame column detection
  - read_dataset_csv() for standardized CSV loading
  - build_and_export_html() for Plotly HTML export with styling

- Create scripts/ingestions/_parser_utils.py:
  - validate_file_path() for git-annex symlink handling
  - read_with_encoding_fallback() for multi-encoding text file reading

- Create scripts/ingestions/_constants.py:
  - Centralize MODALITY_CANONICAL_MAP, NEURO_MODALITIES
  - Extract CTF_INTERNAL_EXTENSIONS, MEF3_INTERNAL_EXTENSIONS

- Refactor plot modules to use shared utilities:
  - Remove duplicate _normalize_modality() from growth.py, clinical_breakdown.py
  - Simplify HTML export in bubble.py, ridgeline.py, treemap.py, plot_sankey.py

- Refactor parsers to use shared utilities:
  - Update _vhdr_parser.py, _snirf_parser.py, _mef3_parser.py
  - Extract _build_bids_search_paths() helper in 3_digest.py
@github-actions
Copy link
Contributor

github-actions bot commented Jan 22, 2026

📚 Documentation Preview

📦 Download Documentation Artifact

Download the documentation-html artifact from the workflow run to view the docs locally.

💡 To enable live previews, add a SURGE_TOKEN secret to this repository. See surge.sh for setup instructions.

- Fix mock patch location for cache (use eegdash.dataset.registry.get_default_cache_dir)
- Update mock data to use proper tags structure (modality, type, pathology)
- Fix test_register_openneuro_datasets to use direct import instead of importlib.util
- Update assertions to match correct field mappings:
  - tags.modality → modality of exp
  - tags.type → type of exp
  - tags.pathology → Type Subject
  - recording_modality → record_modality
- Fix test_registry_docstring_generation to expect omitted Subjects when None
@codecov
Copy link

codecov bot commented Jan 22, 2026

@bruAristimunha bruAristimunha merged commit 0281445 into develop Jan 22, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants