Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can pipeline be more robust to handling missing data? #992

Closed
drammock opened this issue Sep 9, 2024 · 9 comments · Fixed by #1000
Closed

can pipeline be more robust to handling missing data? #992

drammock opened this issue Sep 9, 2024 · 9 comments · Fixed by #1000

Comments

@drammock
Copy link
Member

drammock commented Sep 9, 2024

another "I have a longitudinal study" question here. I have some subjects with missing timepoints (AKA missing sessions). I want the pipeline to look at the config file (which says subjects = "all" and sessions = ["a", "b", "c"]) and then look at the bids data tree and process all the subjectXsession combinations that are present, without crashing because subject 101 is missing session "b". In practice, such errors take the form of:

┌────────┬ init/_02_find_empty_room ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│13:31:56│ 🐛 sub-101 ses-b A critical error occurred. The error message was: Missing file: /redacted/redacted/bids-data/sub-101/ses-b/meg/sub-101_ses-b_task-redacted_split-01

which is confusing; it errors when looking for a split partial file, when the real problem is that there isn't even a .../sub-101/ses-b/ directory.

Proposed solutions

  1. just make the error message better, and do something complicated in the config where I figure out which subject-session combos I have, and run the pipeline multiple times (e.g. once for each session, with different lists of subjects each time)

  2. blithely add an allow_missing=True here, and hope that it doesn't have catastrophic downstream effects:

_update_for_splits(in_files, f"raw_run-{run}", single=True)

  1. it seems like there are a lot of places where we do for subject in subjects and for session in sessions where subjects and sessions are taken straight from the config file each time. So it won't work to do a simple Path(...).exists() type of check to see if the entire session folder is missing, right about here:

    for subject in get_subjects(config):
    for session in get_sessions(config):
    init_subject_dirs(

    But we could do that check, but then store in a private config location the subjectXsession combos that are actually present, and use that info to get subjects/sessions to loop over, throughout the codebase.

Open to other ideas!

@agramfort
Copy link
Member

agramfort commented Sep 10, 2024 via email

@hoechenberger
Copy link
Member

@sappelhoff Do you know if and how BIDS supports data with missing sessions?

@hoechenberger
Copy link
Member

hoechenberger commented Sep 10, 2024

Reading the specs:
https://bids-specification.readthedocs.io/en/stable/longitudinal-and-multi-site-studies.html
https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files.html#sessions-file

I don't see anything that would disallow missing sessions. But this use case isn't explicitly mentioned anywhere, either…

@sappelhoff
Copy link
Member

To the best of my knowledge, it is not disallowed. I also remember preparing a mixed design experiment once, where half of participants had a session that the other half did not have, and that led to a warning (to check whether this was intended), but not an error.

@hoechenberger
Copy link
Member

5. But we could do that check, but then store in a private config location the subjectXsession combos that are actually present, and use that info to get subjects/sessions to loop over, throughout the codebase.

Yep that would be my proposal. However we also have runs within sessions and then stuff gets really complicated, at least I remember seeing some rather convoluted code handling runs within subjects and sessions … maybe we should re-think all of this and create a clean, clear lookup-table early on in the pipeline run and then just refer to this downstream.

@agramfort
Copy link
Member

agramfort commented Sep 10, 2024 via email

@drammock
Copy link
Member Author

To the best of my knowledge, it is not disallowed. I also remember preparing a mixed design experiment once, where half of participants had a session that the other half did not have, and that led to a warning (to check whether this was intended), but not an error.

was that warning from the BIDS validator, or from MNE-BIDS-Pipeline though? If the standard allows it, IMO the pipeline shouldn't error out.

@sappelhoff
Copy link
Member

It was a warning from the BIDS validator.

If the standard allows it, IMO the pipeline shouldn't error out.

agreed

@larsoner
Copy link
Member

maybe we should re-think all of this and create a clean, clear lookup-table early on in the pipeline run and then just refer to this downstream.

Sure this would work. Or (equivalently) something like get_subjects_sessions that returns list[tuple[str, str]] of the subject-session pairs on the fly just like our get_runs_tasks that returns a list[tuple[str, str]] of the run-task pairs. This would be a bit of an intermediate solution but I think it would work?

And I think it would be good to have a config.allow_missing_sessions = False (by default) that you would have to set to True to allow it to just skip missing sessions when not present. That way people won't get any unpleasant surprises thinking they're analyzing a complete dataset when really they're missing some stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants