Skip to content

Conversation

@mturk24
Copy link
Contributor

@mturk24 mturk24 commented Dec 18, 2025

Summary

Fixing the tutorials to read datasets from HuggingFace Hub instead of from S3.

The relevant Dataset cards are:

S3 datasets for each of these tutorials have been deleted.

I ran with Python 3.11 on my local machine to test.

mturk24 and others added 4 commits December 18, 2025 17:25
Replace S3 URLs with HuggingFace Hub dataset loading across all 4 tutorials:
- improving_ml_performance.ipynb: Use load_dataset("Cleanlab/student-grades")
- object_detection.ipynb: Use hf_hub_download for labels, predictions, and images
- segmentation.ipynb: Use hf_hub_download for given_masks and predicted_masks
- token_classification.ipynb: Use hf_hub_download for pred_probs

All tutorials now load data from HuggingFace Hub instead of S3, with proper
imports and dependencies added (datasets, huggingface_hub).

🤖 Generated with Claude Code
The tutorial notebooks were failing with 404 errors because hf_hub_download()
defaults to looking for model repositories, not dataset repositories.

Fixed by adding repo_type="dataset" parameter to all hf_hub_download() calls in:
- object_detection.ipynb (3 downloads: labels.pkl, predictions.pkl, example_images.zip)
- segmentation.ipynb (2 downloads: given_masks.npy, predicted_masks.npy)
- token_classification.ipynb (1 download: pred_probs.npz)

This ensures the downloads use the correct URL format:
https://huggingface.co/datasets/Cleanlab/... instead of
https://huggingface.co/Cleanlab/...

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@mturk24 mturk24 changed the title [WIP]: Migrate tutorials to hf hub Migrate CLOS tutorials to hf hub Dec 19, 2025
@mturk24 mturk24 requested review from elisno and jwmueller December 19, 2025 14:59
"source": [
"# Package installation (hidden on docs website).\n",
"dependencies = [\"cleanlab\", \"matplotlib\"]\n",
"dependencies = [\"cleanlab\", \"matplotlib\", \"huggingface_hub\"]\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also cap the version of huggingface_hub in our developer/build dependencies to latest version (assuming you tested with that latest version)

huggingface_hub==0.25.2 # TODO: uncap version

It's currently set at old version: huggingface_hub==0.25.2

"%%capture\n",
"!wget -nc 'https://cleanlab-public.s3.amazonaws.com/ImageSegmentation/predicted_masks.npy' "
]
"source": "from huggingface_hub import hf_hub_download\n\n# Download from HuggingFace Hub\ngiven_masks_path = hf_hub_download('Cleanlab/segmentation-tutorial', 'given_masks.npy', repo_type=\"dataset\")\npredicted_masks_path = hf_hub_download('Cleanlab/segmentation-tutorial', 'predicted_masks.npy', repo_type=\"dataset\")"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case this happens to print a lot of stuff (which we don't want to show on our live docs site), then keep the:
%%capture, statement at the top of cell (same for all other tutorials)

@mturk24
Copy link
Contributor Author

mturk24 commented Jan 7, 2026

Fixes are in this PR: #1274

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.91%. Comparing base (98b3ef0) to head (cbd0e7f).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1271      +/-   ##
==========================================
- Coverage   94.97%   94.91%   -0.07%     
==========================================
  Files          81       81              
  Lines        5913     5916       +3     
  Branches      791      791              
==========================================
- Hits         5616     5615       -1     
- Misses        217      219       +2     
- Partials       80       82       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ulya-tkch
Copy link
Collaborator

Thank you for all this work! Everything was very helpful and comited in this PR :D
#1278

Super appreciative!

@ulya-tkch ulya-tkch closed this Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants