can pipeline be more robust to handling missing data? #992

drammock · 2024-09-09T21:09:00Z

another "I have a longitudinal study" question here. I have some subjects with missing timepoints (AKA missing sessions). I want the pipeline to look at the config file (which says subjects = "all" and sessions = ["a", "b", "c"]) and then look at the bids data tree and process all the subjectXsession combinations that are present, without crashing because subject 101 is missing session "b". In practice, such errors take the form of:

┌────────┬ init/_02_find_empty_room ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│13:31:56│ 🐛 sub-101 ses-b A critical error occurred. The error message was: Missing file: /redacted/redacted/bids-data/sub-101/ses-b/meg/sub-101_ses-b_task-redacted_split-01

which is confusing; it errors when looking for a split partial file, when the real problem is that there isn't even a .../sub-101/ses-b/ directory.

Proposed solutions

just make the error message better, and do something complicated in the config where I figure out which subject-session combos I have, and run the pipeline multiple times (e.g. once for each session, with different lists of subjects each time)
blithely add an allow_missing=True here, and hope that it doesn't have catastrophic downstream effects:

mne-bids-pipeline/mne_bids_pipeline/steps/init/_02_find_empty_room.py

Line 45 in 61f2961

_update_for_splits(in_files, f"raw_run-{run}", single=True)

it seems like there are a lot of places where we do for subject in subjects and for session in sessions where subjects and sessions are taken straight from the config file each time. So it won't work to do a simple Path(...).exists() type of check to see if the entire session folder is missing, right about here:

mne-bids-pipeline/mne_bids_pipeline/steps/init/_01_init_derivatives_dir.py

Lines 79 to 81 in 61f2961

    
           for subject in get_subjects(config): 
        
               for session in get_sessions(config): 
        
                   init_subject_dirs(

But we could do that check, but then store in a private config location the subjectXsession combos that are actually present, and use that info to get subjects/sessions to loop over, throughout the codebase.

Open to other ideas!

The text was updated successfully, but these errors were encountered:

agramfort · 2024-09-10T07:04:17Z

I've also run into this issue... I remember that BIDS was considering this as problematic datasets but it's such a prevalent usecase...

…

Message ID: ***@***.***>

hoechenberger · 2024-09-10T10:17:16Z

@sappelhoff Do you know if and how BIDS supports data with missing sessions?

hoechenberger · 2024-09-10T10:24:13Z

Reading the specs:
https://bids-specification.readthedocs.io/en/stable/longitudinal-and-multi-site-studies.html
https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files.html#sessions-file

I don't see anything that would disallow missing sessions. But this use case isn't explicitly mentioned anywhere, either…

sappelhoff · 2024-09-10T12:07:40Z

To the best of my knowledge, it is not disallowed. I also remember preparing a mixed design experiment once, where half of participants had a session that the other half did not have, and that led to a warning (to check whether this was intended), but not an error.

hoechenberger · 2024-09-10T12:16:57Z

5. But we could do that check, but then store in a private config location the subjectXsession combos that are actually present, and use that info to get subjects/sessions to loop over, throughout the codebase.

Yep that would be my proposal. However we also have runs within sessions and then stuff gets really complicated, at least I remember seeing some rather convoluted code handling runs within subjects and sessions … maybe we should re-think all of this and create a clean, clear lookup-table early on in the pipeline run and then just refer to this downstream.

agramfort · 2024-09-10T19:06:19Z

+1 it's great if it's just a warning. I have definitely seen this warning in the past.

…

Message ID: ***@***.***>

drammock · 2024-09-10T20:35:08Z

To the best of my knowledge, it is not disallowed. I also remember preparing a mixed design experiment once, where half of participants had a session that the other half did not have, and that led to a warning (to check whether this was intended), but not an error.

was that warning from the BIDS validator, or from MNE-BIDS-Pipeline though? If the standard allows it, IMO the pipeline shouldn't error out.

sappelhoff · 2024-09-10T20:48:05Z

It was a warning from the BIDS validator.

If the standard allows it, IMO the pipeline shouldn't error out.

agreed

larsoner · 2024-09-13T13:25:38Z

maybe we should re-think all of this and create a clean, clear lookup-table early on in the pipeline run and then just refer to this downstream.

Sure this would work. Or (equivalently) something like get_subjects_sessions that returns list[tuple[str, str]] of the subject-session pairs on the fly just like our get_runs_tasks that returns a list[tuple[str, str]] of the run-task pairs. This would be a bit of an intermediate solution but I think it would work?

And I think it would be good to have a config.allow_missing_sessions = False (by default) that you would have to set to True to allow it to just skip missing sessions when not present. That way people won't get any unpleasant surprises thinking they're analyzing a complete dataset when really they're missing some stuff.

drammock mentioned this issue Sep 26, 2024

allow missing sessions #1000

Merged

1 task

larsoner closed this as completed in #1000 Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can pipeline be more robust to handling missing data? #992

can pipeline be more robust to handling missing data? #992

drammock commented Sep 9, 2024

agramfort commented Sep 10, 2024 via email

hoechenberger commented Sep 10, 2024

hoechenberger commented Sep 10, 2024 •

edited

Loading

sappelhoff commented Sep 10, 2024

hoechenberger commented Sep 10, 2024

agramfort commented Sep 10, 2024 via email

drammock commented Sep 10, 2024

sappelhoff commented Sep 10, 2024

larsoner commented Sep 13, 2024

can pipeline be more robust to handling missing data? #992

can pipeline be more robust to handling missing data? #992

Comments

drammock commented Sep 9, 2024

Proposed solutions

agramfort commented Sep 10, 2024 via email

hoechenberger commented Sep 10, 2024

hoechenberger commented Sep 10, 2024 • edited Loading

sappelhoff commented Sep 10, 2024

hoechenberger commented Sep 10, 2024

agramfort commented Sep 10, 2024 via email

drammock commented Sep 10, 2024

sappelhoff commented Sep 10, 2024

larsoner commented Sep 13, 2024

hoechenberger commented Sep 10, 2024 •

edited

Loading