Skip to content

Conversation

@CloseChoice
Copy link
Contributor

@CloseChoice CloseChoice commented Oct 13, 2025

Add support for NIfTI.

supports #7804

This PR follows #7325 very closely

I am a bit unsure what we need to add to the document_dataset.mdx and document_load.mdx. I should probably create a dataset on the hub first to create this guide instead of copy+pasting from PDF.

Open todos:

  • create nifti dataset on the hub
  • [ ] update document_dataset.mdx and document_load.mdx

EDIT:
I tested with two datasets I created on the hub:

for zipped (file extension .nii.gz and unzipped .nii) files and both seem to work fine. Also tested loading locally and that seems to work as well.
Here is the scriptsthat I ran against the hub:

from pathlib import Path

from datasets import load_dataset
import nibabel as nib


dataset = load_dataset(
        "TobiasPitters/test-nifti-unzipped",
        split="test"  # Load as single Dataset, not DatasetDict
)

print("length dataset unzipped:", len(dataset))
for item in dataset:
    isinstance(item["nifti"], nib.nifti1.Nifti1Image)

dataset = load_dataset(
        "TobiasPitters/test-nifti",
        split="train"  # Load as single Dataset, not DatasetDict
)
print("length dataset zipped:", len(dataset))
for item in dataset:
    isinstance(item["nifti"], nib.nifti1.Nifti1Image)

@CloseChoice CloseChoice marked this pull request as ready for review October 14, 2025 17:51
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is awesome ! the code looks all good to me

I am a bit unsure what we need to add to the document_dataset.mdx and document_load.mdx. I should probably create a dataset on the hub first to create this guide instead of copy+pasting from PDF.

imo you could get some inspiration from the PDF docs indeed but showcase how it works for an actual dataset, and ideally what are the main usage of Nifti1Image in general and also in a training setting (convert to PIL.Image or torch tensor for example)

return test_case


def require_nibabel(test_case):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to add nibabel to setup.py in the test dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,thanks

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Oct 20, 2025

Btw I couldn't resist but share your PR with the community online on twitter already, I hope this is fine !

Copy link
Contributor Author

@CloseChoice CloseChoice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, docs are updated, code in the docs is tested as well and works. Happy for another review round.

EDIT: I created a proper nifti dataset on the hub: https://huggingface.co/datasets/TobiasPitters/NIfTI-SIRF-exercises-geometry, but thought it's not good practice to reference personal (yet public) datasets in the docs.

return test_case


def require_nibabel(test_case):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,thanks

@CloseChoice CloseChoice requested a review from lhoestq October 21, 2025 15:53
@CloseChoice
Copy link
Contributor Author

Btw I couldn't resist but share your PR with the community online on twitter already, I hope this is fine !

Wow, that was quick! Thanks, already liked your comment, I appreciate it!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

@lhoestq lhoestq merged commit 5138876 into huggingface:main Oct 24, 2025
2 of 14 checks passed
@CloseChoice CloseChoice deleted the add-nifti-support branch October 24, 2025 14:32
@CloseChoice CloseChoice mentioned this pull request Oct 28, 2025
1 task
@lhoestq
Copy link
Member

lhoestq commented Nov 4, 2025

NIfTI support is out in datasets==4.4.0 ! 🥳

Btw do you know a good NIfTI vizualizer in HTML/JS or using python ? We could add something like .to_html() (or equivalent) to view data in a notebook and enable the Dataset Viewer on HF if it can be useful

cc @cfahlgren1 @georgiachanning for viz

@JINAILAB
Copy link

JINAILAB commented Nov 5, 2025

Hi, I have a quick question while testing the new NIfTI support.

I cloned the latest main branch, installed it locally using pip install -e ., and ran the following:

from datasets import load_dataset

dataset = load_dataset(
    "TobiasPitters/NIfTI-SIRF-exercises-geometry",
    split="train"
)
dataset[0]['nifti'].get_fdata()[0].shape

However, I’m getting the following error:

FileNotFoundError: No such file or no access: 'data/nifti/OBJECT_phantom_T2W_TSE_Sag_18_1.nii'

When I manually place the NIfTI file in that local path, it works fine.
But I assume the intended behavior is for the .nii file to be included in the dataset hosted on the Hub, so that load_dataset() automatically loads it without relying on a local file path.

Interestingly, after adding the embed_storage method, it started working properly.
Could you please confirm whether this is the expected behavior, or if my previous setup was missing something?

@CloseChoice
Copy link
Contributor Author

from datasets import load_dataset

dataset = load_dataset(
    "TobiasPitters/NIfTI-SIRF-exercises-geometry",
    split="train"
)
dataset[0]['nifti'].get_fdata()[0].shape

Thanks for the report, I can confirm this. It's a problem with the dataset. Can you try this:

from datasets import load_dataset
import nibabel as nib

dataset = load_dataset(
        "TobiasPitters/test-nifti-unzipped",
        split="train"  # Load as single Dataset, not DatasetDict
)

print("length dataset:", len(dataset))
for item in dataset:
    assert isinstance(item["nifti"], nib.nifti1.Nifti1Image)

If your interested in "TobiasPitters/NIfTI-SIRF-exercises-geometry" I can give it a shot to reupload correctly, otherwise I'd take it down.

@CloseChoice
Copy link
Contributor Author

CloseChoice commented Nov 5, 2025

NIfTI

I would suggest https://github.com/rii-mango/Papaya, just tested it and it looks quite good. How would it work to add that to the dataset-viewer?

And I assume you'd like to have the to_html feature on the NifTI class?

EDIT: do we have anything like this already for other features? Couldn't find anything. I mean we can do this in different ways, simply inlining papaya or building custom components (like e.g. SHAP is doing). If we decide for the latter, this means that we'll need to build js components in datasets, so we'll need a bundler, etc. but this provides the highest flexibility. If that's of interest, I can take a look into this.

@JINAILAB
Copy link

JINAILAB commented Nov 5, 2025

Hi, thanks for your help earlier. I tested the dataset you shared (TobiasPitters/test-nifti-unzipped), and it works perfectly — all NIfTI files load correctly and get_fdata() returns valid arrays.

However, when I upload my own dataset to the Hugging Face Hub using my code, it doesn’t work properly. The NIfTI files seem not to decode correctly. can you check it?

train_dataset = Dataset.from_pandas(train_df)

def cast_dataset(dataset):
    dataset = dataset.cast_column("nifti", Nifti(decode=True))
    dataset = dataset.cast_column("label", ClassLabel(num_classes=10, names=[str(i) for i in range(10)]))

train_dataset = cast_dataset(train_dataset)

@CloseChoice
Copy link
Contributor Author

CloseChoice commented Nov 6, 2025

Hi, thanks for your help earlier. I tested the dataset you shared (TobiasPitters/test-nifti-unzipped), and it works perfectly — all NIfTI files load correctly and get_fdata() returns valid arrays.

However, when I upload my own dataset to the Hugging Face Hub using my code, it doesn’t work properly. The NIfTI files seem not to decode correctly. can you check it?

train_dataset = Dataset.from_pandas(train_df)

def cast_dataset(dataset):
    dataset = dataset.cast_column("nifti", Nifti(decode=True))
    dataset = dataset.cast_column("label", ClassLabel(num_classes=10, names=[str(i) for i in range(10)]))

train_dataset = cast_dataset(train_dataset)

Are you using zipped Nifti files? It seems like there is an issue with that. I found that this creates problems, that in decode_example the path is something like 'gzip://T1.nii::/home/tobias/programming/github/datasets/nitest-balls1/NIFTI/T1.nii.gz', and then we go down the remote path which results in an KeyError since repo_id is not specified. The root cause for this is in the DownloadManager.extract method, where we extract compressed files.

@lhoestq : what do you suggest here? We could probably do something like this in the decode_example:

if path.startswith("gzip:"):
    path = path.split("::")[-1]

Though I would need to test if this is actually OS agnostic.

@CloseChoice CloseChoice mentioned this pull request Nov 6, 2025
@lhoestq
Copy link
Member

lhoestq commented Nov 6, 2025

I think the issue with gzip can be fixed using the same code as in Image() imo:

- try:
-     repo_id = string_to_dict(source_url, pattern)["repo_id"]
-     token = token_per_repo_id.get(repo_id)
- except ValueError:
-     token = None
+ source_url_fields = string_to_dict(source_url, pattern)
+ token = (
+     token_per_repo_id.get(source_url_fields["repo_id"]) if source_url_fields is not None else None
+ )

@CloseChoice
Copy link
Contributor Author

Hi, thanks for your help earlier. I tested the dataset you shared (TobiasPitters/test-nifti-unzipped), and it works perfectly — all NIfTI files load correctly and get_fdata() returns valid arrays.

However, when I upload my own dataset to the Hugging Face Hub using my code, it doesn’t work properly. The NIfTI files seem not to decode correctly. can you check it?

train_dataset = Dataset.from_pandas(train_df)

def cast_dataset(dataset):
    dataset = dataset.cast_column("nifti", Nifti(decode=True))
    dataset = dataset.cast_column("label", ClassLabel(num_classes=10, names=[str(i) for i in range(10)]))

train_dataset = cast_dataset(train_dataset)

Can you pls try with this branch:

pip install git+https://github.com/CloseChoice/datasets.git@fix-embed-storage-nifti

This should fix the existing problems with NifTI

@JINAILAB
Copy link

JINAILAB commented Nov 7, 2025

I checked and it looks like the fix-embed-storage-nifti branch has already been merged into the main. And it worked fine. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants