feat(docs): improve dataset documentation with tags, feedback section, and visualization updates #235

bruAristimunha · 2026-01-21T21:01:37Z

Summary

Replace dataset categorization with tag structure and update docs/summary tables; use computed titles from API.
Add dataset feedback section with GitHub issue button and doc build fixes/parallelization.
Improve chart rendering (ridgeline modality toggle, clinical breakdown colors, growth/scale fixes, DOI date fallback).
Add/refresh ingestion and parsing utilities (VHDR/SNIRF/MEF3 parsers, validation checks, MEG metadata fixes; consolidated plot_dataset helpers).
Standardize bids_relpath as canonical key and improve dataset utilities (registry caching gate, download of dataset-level files).
Align ntimes to sample counts, derive from sidecar duration/sfreq, and update metadata duration math and tests.
Fix complexity SVD entropy for numpy versions without np.linalg.svdvals.
Refactors/maintenance: remove deprecated Paradigm, pin pandas<2.2, pre-commit and PEP8/nested-function cleanups, color mapping cleanup, and revert NEMAR CLI support to keep a GitHub-only approach.

Test Plan

NUMBA_DISABLE_JIT=1 python -m pytest tests/unit_tests/dataset/test_base.py tests/unit_tests/dataset/test_bids_dataset.py tests/unit_tests/dataset/test_dataset.py tests/unit_tests/dataset/test_dataset_maturity.py (111 passed, 1 skipped)
NUMBA_DISABLE_JIT=1 python -m pytest tests/unit_tests/features/feature_bank/test_complexity.py (9 passed)
NUMBA_DISABLE_JIT=1 python -m pytest tests/integration/test_benchmarks.py::TestMetadataAccessPerformance::test_num_times_query (skipped: dataset missing at .eegdash_cache/ds005509-bdf-mini)

Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com> Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com> Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>

…match

Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com> Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>

…d/html - Fix artifact upload path - Fix surge preview deployment path - Fix gh-pages deployment for both main and develop branches - Resolves documentation deployment failure

Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com> Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>

- Added interactive growth plots with consistent modality colors - Added clinical breakdown stacked bar charts - Updated dataset summary page with new tabs - Added color definitions for new modalities (EEG, fNIRS, etc) - Fixed linting issues (E701, F841)

…ription

- Add BIDS inheritance fallback for MEG sidecar files (searches parent directories and handles run/acq entity variations) - Prefer channel_labels count from channels.tsv over partial sidecar counts - Add CTF .ds directory detection for has_actual_files check - Fix bug where 'name' variable was overwritten in root file scan loop

- Add data quality validation for nchans and sampling_frequency fields - Make validation run automatically before injection (--skip-validation to bypass) - Add --data-quality-threshold parameter to control acceptable missing data percentage - Create _validate.py module for reusable validation functions - Add data quality statistics to validation summary output - Track datasets with missing channel counts or sampling frequency

- Add Tags TypedDict with pathology, modality, and type fields - Update Dataset TypedDict to include tags field - Update create_dataset function to accept tags_* parameters - Update registry to consume tags from API with fallback to legacy clinical/paradigm - Deprecate clinical and paradigm fields in favor of tags

- Update main_from_json to read from tags.pathology, tags.modality, tags.type - Add fallback to legacy clinical/paradigm fields for backwards compatibility - Update prepare_table to handle direct tags columns from API

- Add Description section with README content to dataset pages - Convert markdown/RST headers to bold text to avoid document structure issues - Handle box-style headers (em-dashes) in README content - Use computed_title from API for proper dataset titles - Use nchans_counts/sfreq_counts from datasets collection - Add collapsible dropdown for long READMEs (>30 lines)

- Update plot scripts to use new tags structure for dataset classification - Improve clinical breakdown, growth, sankey, and treemap visualizations - Update prepare_summary_tables for tags-based data extraction

- Add _nemar.py helper module with CLI subprocess wrappers - Update nemar.py to use CLI as primary method with GitHub fallback - Add --use-cli and --use-github flags for method selection - Update fetch-source.yml workflow with Bun/NEMAR CLI setup options - Update 1-fetch-nemar.yml to enable CLI installation The NEMAR CLI (nemar-cli via Bun) provides direct access to the NEMAR API for listing datasets. GitHub API fallback is maintained for compatibility. Currently using --skip-bids as GitHub repos don't exist yet.

- Update validate_record to only require bids_relpath (not bidspath) - Add documentation explaining bidspath is computed from bids_relpath - Document flatten_entities conflict resolution logic - Document fingerprint fallback chain (storage.raw_key -> bids_relpath)

…n charts

- Add toggle buttons to switch between Experiment Modality and Recording Modality views - Extract trace building logic into reusable _build_ridgeline_traces function - Add primary_recording_modality utility for canonical recording modality labels - Filter out datasets with 0 or negative participants to avoid log10 issues

- Convert numpy arrays to Python lists in ridgeline chart to avoid Plotly binary encoding issues - Add resting_state canonical mapping for consistency

- Add distinct colors for all pathology types (epilepsy, depression, alzheimer, etc.) - Include both title case and lowercase variants for color matching - Handle NaN/nan strings in modality and population_type columns - Filter out unknown modalities from the chart

- Fix plot scaling by using explicit CSS pixel heights instead of height: 100% - Increase plot heights to 1000px for better visibility (growth, clinical_breakdown) - Make colors consistent across Sankey and Clinical Breakdown charts - Use PATHOLOGY_PASTEL_OVERRIDES for all clinical conditions - Improve Unspecified Clinical color visibility (#fda4af)

- Add fetch_chart_data_from_api() for optimized chart generation - Simplify fetch_datasets_from_api() to single API call (stats embedded) - Add parallel chart generation in prepare_summary_tables.py - Add duration_hours_total field for treemap compatibility - Add .env.example with environment variable documentation - Add CLAUDE.md with project guide - Add api_helper.py CLI tool for common API operations

- Add DOI resolution fallback for publication dates in openneuro.py and nemar.py - Simplify prepare_summary_tables.py (remove JSON mode, ~700 lines reduced) - Filter EEG2025* datasets from chart data (HBN competition datasets) - Fix empty "Type Subject" → "Unknown" in registry.py - Adjust chart heights and margins for better rendering: - Sankey: height 900→1100, margins adjusted - Clinical: height 1000→650, bottom margin increased - Growth: height 1000→550, bottom margin increased

- growth.py: Move _normalize_modality to module level - prepare_summary_tables.py: Move _normalise_tag and _strip_unknown to module level - registry.py: Move _make_dataset_init, _clean_optional, and _clean_or_unknown to module level

Move nested functions to module level across the codebase: - eegdash/features/inspect.py: Extract _is_feature, _is_feature_extractor, _is_feature_preprocessor, _is_feature_kind predicates - docs/source/conf.py: Extract _stat_line, _is_decorative_line helpers - docs/plot_dataset/ridgeline.py: Extract _convert_arrays helper - scripts/create_metadata.py: Extract _s3_size_worker for parallel S3 queries - scripts/ingestions/5_inject.py: Extract _sanitize_for_json and _inject_records_batch for parallel record injection - scripts/ingestions/_file_utils.py: Extract _fetch_scidb_path and _propfind_datarn for recursive file listing Add pre-commit hook (scripts/check_nested_functions.py) to prevent nested function definitions in future code.

Add helper function _create_color_map_with_aliases() that automatically generates lowercase, UPPERCASE, and Title Case variants for each color key. Refactor PATHOLOGY_PASTEL_OVERRIDES from 38 lines of manual duplication to a clean 22-line base dict with auto-generated aliases. This makes adding new colors a single-line change instead of 2-3 lines.

Reverts the NEMAR CLI changes from 7d215d0. Will revisit later.

Move inline imports from inside functions to the top of modules to follow PEP8 guidelines and improve code clarity.

Add fallback metadata extraction from neurophysiology file formats: - VHDR parser: Extract sampling_frequency, nchans, ch_names from BrainVision headers - SNIRF parser: Extract metadata from fNIRS files using MNE or h5py fallback - MEF3 parser: Extract metadata from MEF3 directory structures These parsers provide metadata extraction when BIDS sidecar files are missing. Also includes comprehensive test suite for VHDR parser functionality.

Remove the Paradigm TypedDict and related parameters from create_dataset. Use the tags field with modality and type keys instead.

wfdb has a bug with pandas ArrowStringArray that causes import failures. Pin pandas to <2.2 until wfdb releases a fix.

Add TestSchemaContract class with pytest.parametrize for automatic testing of all Pydantic/TypedDict pairs: - PYDANTIC_TYPEDDICT_PAIRS: exact field match (Storage, Entities) - PYDANTIC_SUBSET_PAIRS: Pydantic subset of TypedDict (Record, Dataset) - DOCUMENTED_TYPEDDICTS: all TypedDicts must have docstrings - REMOVED_SCHEMAS: verify deprecated schemas are gone Also fixes schema mismatch: adds missing ingestion_fingerprint field to Dataset TypedDict (was only in DatasetModel). To add new schemas, just add them to the lists and they're auto-tested.

- Add shared utilities to docs/plot_dataset/utils.py: - normalize_modality_string() for consistent modality formatting - detect_modality_column() for DataFrame column detection - read_dataset_csv() for standardized CSV loading - build_and_export_html() for Plotly HTML export with styling - Create scripts/ingestions/_parser_utils.py: - validate_file_path() for git-annex symlink handling - read_with_encoding_fallback() for multi-encoding text file reading - Create scripts/ingestions/_constants.py: - Centralize MODALITY_CANONICAL_MAP, NEURO_MODALITIES - Extract CTF_INTERNAL_EXTENSIONS, MEF3_INTERNAL_EXTENSIONS - Refactor plot modules to use shared utilities: - Remove duplicate _normalize_modality() from growth.py, clinical_breakdown.py - Simplify HTML export in bubble.py, ridgeline.py, treemap.py, plot_sankey.py - Refactor parsers to use shared utilities: - Update _vhdr_parser.py, _snirf_parser.py, _mef3_parser.py - Extract _build_bids_search_paths() helper in 3_digest.py

github-actions · 2026-01-22T14:50:45Z

📚 Documentation Preview

📦 Download Documentation Artifact

Download the documentation-html artifact from the workflow run to view the docs locally.

💡 To enable live previews, add a SURGE_TOKEN secret to this repository. See surge.sh for setup instructions.

- Fix mock patch location for cache (use eegdash.dataset.registry.get_default_cache_dir) - Update mock data to use proper tags structure (modality, type, pathology) - Fix test_register_openneuro_datasets to use direct import instead of importlib.util - Update assertions to match correct field mappings: - tags.modality → modality of exp - tags.type → type of exp - tags.pathology → Type Subject - recording_modality → record_modality - Fix test_registry_docstring_generation to expect omitted Subjects when None

codecov · 2026-01-22T15:31:00Z

Codecov Report

❌ Patch coverage is 68.15476% with 107 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
eegdash/dataset/registry.py	60.38%	61 Missing ⚠️
eegdash/http_api_client.py	30.43%	16 Missing ⚠️
eegdash/dataset/io.py	87.50%	8 Missing ⚠️
eegdash/schemas.py	63.63%	8 Missing ⚠️
eegdash/dataset/dataset.py	68.42%	6 Missing ⚠️
eegdash/features/inspect.py	82.35%	3 Missing ⚠️
eegdash/api.py	50.00%	2 Missing ⚠️
eegdash/dataset/bids_dataset.py	83.33%	2 Missing ⚠️
eegdash/downloader.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

bruAristimunha and others added 30 commits January 8, 2026 13:54

Sinc the documentation (#223)

b3481d3

Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com> Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>

Use API dataset summary for CI docs (#224)

c9bd6e0

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Fix warning spam during EEGChallengeDataset download (#226)

3c3dc6b

feat: Mass Ingestion Fixes and Optimization (#227)

9fee4b7

Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com> Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>

Merge branch 'develop' into main

4d623aa

double release

d185e2c

fix(docs): remove circular dependency in Makefile catch-all rule

f78f3bd

fix(docs): add explicit no-op rule for Makefile to prevent catch-all …

2e3e7dd

…match

fix(docs): explicitely define html target recipe to fix silent failure

7d8ef53

updating the Makefile

0a936f2

[MNT] Deploying the webpage (#230)

b3157b8

Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com> Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>

fix: correct documentation build output path from build/html to _buil…

eccb62a

…d/html - Fix artifact upload path - Fix surge preview deployment path - Fix gh-pages deployment for both main and develop branches - Resolves documentation deployment failure

docs: parallelize dataset rst generation with ThreadPoolExecutor

9bfd70a

Docs: Parallelize build, improve aesthetics, and fix quickstart (#232)

1e4c685

Co-authored-by: Arnaud Delorme <arnodelorme@gmail.com> Co-authored-by: Kkuntal990 <kokatekuntal@gmail.com>

Update README.md

781829a

Merge main into develop to sync branches

d1a45a2

Merge remote-tracking branch 'origin/develop' into improving-the-desc…

19bbb75

…ription

docs: Move citation to page top and enhance metadata mapping

3c57ddd

feat(docs): update prepare_summary_tables to consume tags structure

94a26aa

- Update main_from_json to read from tags.pathology, tags.modality, tags.type - Add fallback to legacy clinical/paradigm fields for backwards compatibility - Update prepare_table to handle direct tags columns from API

fix(docs): use computed_title from API for dataset pages

46b459e

feat(docs): update plot scripts and summary tables for tags structure

bddf54d

- Update plot scripts to use new tags structure for dataset classification - Improve clinical breakdown, growth, sankey, and treemap visualizations - Update prepare_summary_tables for tags-based data extraction

updating details about the documentation

37faa8a

docs: add feedback section with GitHub issue button for dataset pages

2a9781a

bruAristimunha added 6 commits January 21, 2026 17:57

fix(plots): remove redundant titles from growth and clinical breakdow…

1b56c80

…n charts

fix(plots): convert numpy arrays to lists for Plotly compatibility

a6a2a52

- Convert numpy arrays to Python lists in ridgeline chart to avoid Plotly binary encoding issues - Add resting_state canonical mapping for consistency

bruAristimunha force-pushed the improving-the-description branch from b6f97fa to 1e2f3e2 Compare January 22, 2026 11:33

bruAristimunha added 2 commits January 22, 2026 12:38

chore: gitignore CLAUDE.md and .claude folder for privacy

a147277

bruAristimunha force-pushed the improving-the-description branch from 9c33dd4 to a03c100 Compare January 22, 2026 13:20

bruAristimunha force-pushed the develop branch from 9241aa3 to 75acb74 Compare January 22, 2026 13:20

bruAristimunha added 14 commits January 22, 2026 14:22

Delete .env.example

f0d687a

revert: remove NEMAR CLI support, restore GitHub-only approach

305da16

Reverts the NEMAR CLI changes from 7d215d0. Will revisit later.

refactor: move imports to module level for PEP8 compliance

f68ed4e

Move inline imports from inside functions to the top of modules to follow PEP8 guidelines and improve code clarity.

refactor(schemas): remove deprecated Paradigm class

a91a712

Remove the Paradigm TypedDict and related parameters from create_dataset. Use the tags field with modality and type keys instead.

fix(deps): pin pandas<2.2 for wfdb compatibility

7502cef

wfdb has a bug with pandas ArrowStringArray that causes import failures. Pin pandas to <2.2 until wfdb releases a fix.

fix(pre-commit): remove PLC0415 rule for lazy imports

4891578

test(schemas): remove test for deprecated Paradigm class

3af33b9

Fix ntimes handling and dataset utilities

35a40bb

bruAristimunha added 2 commits January 22, 2026 16:05

Fix SVD entropy for older numpy

4de92e8

bruAristimunha merged commit 0281445 into develop Jan 22, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(docs): improve dataset documentation with tags, feedback section, and visualization updates #235

feat(docs): improve dataset documentation with tags, feedback section, and visualization updates #235

Uh oh!

bruAristimunha commented Jan 21, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(docs): improve dataset documentation with tags, feedback section, and visualization updates #235

feat(docs): improve dataset documentation with tags, feedback section, and visualization updates #235

Uh oh!

Conversation

bruAristimunha commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

github-actions bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📚 Documentation Preview

Uh oh!

codecov bot commented Jan 22, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bruAristimunha commented Jan 21, 2026 •

edited

Loading

github-actions bot commented Jan 22, 2026 •

edited

Loading