Skip to content

fix(deps): update dependency datasets to v4 #13502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 10, 2025

Conversation

renovate-bot
Copy link
Contributor

This PR contains the following updates:

Package Change Age Confidence
datasets ==3.0.1 -> ==4.0.0 age confidence

Release Notes

huggingface/datasets (datasets)

v4.0.0

Compare Source

New Features

Build streaming data pipelines in a few lines of code !

from datasets import load_dataset

ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)


* Add `num_proc=` to `.push_to_hub()` (Dataset and IterableDataset) by @​lhoestq in https://github.com/huggingface/datasets/pull/7606

```python

### Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)

Syntax:

ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)

Iterate on a column:

for text in ds["text"]:
...

Load one cell without bringing the full column in memory

first_text = ds["text"][0] # equivalent to ds[0]["text"]

* Torchcodec decoding by @​TyTodd in https://github.com/huggingface/datasets/pull/7616
- Enables streaming only the ranges you need ! 

```python

### Don't download full audios/videos when it's not necessary
### Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
for example in ds:
    video = example["video"]
    frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames
  • Requires torch>=2.7.0 and FFmpeg >= 4
  • Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
  • Load audio data with AudioDecoder:
audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate  # 16000

### old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]
  • Load video data with VideoDecoder:
### video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape  # (3, 240, 320)
first_frame.pts_seconds  # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

  • Remove scripts altogether by @​lhoestq in https://github.com/huggingface/datasets/pull/7592

    • trust_remote_code is no longer supported
  • Torchcodec decoding by @​TyTodd in https://github.com/huggingface/datasets/pull/7616

    • torchcodec replaces soundfile for audio decoding
    • torchcodec replaces decord for video decoding
  • Replace Sequence by List by @​lhoestq in https://github.com/huggingface/datasets/pull/7634

    • Introduction of the List type
    from datasets import Features, List, Value
    
    features = Features({
        "texts": List(Value("string")),
        "four_paragraphs": List(Value("string"), length=4)
    })
    • Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature
    from datasets import Sequence
    
    Sequence(Value("string"))  # List(Value("string"))
    Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.6.0...4.0.0

v3.6.0

Compare Source

Dataset Features

Other improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.5.1...3.6.0

v3.5.1

Compare Source

Bug fixes

Other improvements

New Contributors

Full Changelog: huggingface/datasets@3.5.0...3.5.1

v3.5.0

Compare Source

Datasets Features

>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...

What's Changed

New Contributors

Full Changelog: huggingface/datasets@3.4.1...3.5.0

v3.4.1

Compare Source

Bug Fixes

Full Changelog: huggingface/datasets@3.4.0...3.4.1

v3.4.0

Compare Source

Dataset Features

  • Faster folder based builder + parquet support + allow repeated media + use torchvideo by @​lhoestq in https://github.com/huggingface/datasets/pull/7424

    • /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version
    from datasets import load_dataset, Video
    
    dataset = load_dataset("path/to/video/folder", split="train")
    dataset[0]["video"]  # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
    • faster streaming for image/audio/video folder from Hugging Face
    • support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
  • Add IterableDataset.decode with multithreading by @​lhoestq in https://github.com/huggingface/datasets/pull/7450

    • even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
    dataset = dataset.decode(num_threads=num_threads)
  • Add with_split to DatasetDict.map by @​jp1924 in https://github.com/huggingface/datasets/pull/7368

General improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.3.2...3.4.0

v3.3.2

Compare Source

Bug fixes

Other general improvements

New Contributors

Full Changelog: huggingface/datasets@3.3.1...3.3.2

v3.3.1

Compare Source

Bug fixes

Full Changelog: huggingface/datasets@3.3.0...3.3.1

v3.3.0

Compare Source

Dataset Features

  • Support async functions in map() by @​lhoestq in https://github.com/huggingface/datasets/pull/7384

    • Especially useful to download content like images or call inference APIs
    prompt = "Answer the following question: {question}. You should think step by step."
    async def ask_llm(example):
        return await query_model(prompt.format(question=example["question"]))
    ds = ds.map(ask_llm)
  • Add repeat method to datasets by @​alex-hh in https://github.com/huggingface/datasets/pull/7198

    ds = ds.repeat(10)
  • Support faster processing using pandas or polars functions in IterableDataset.map() by @​lhoestq in https://github.com/huggingface/datasets/pull/7370

    • Add support for "pandas" and "polars" formats in IterableDatasets
    • This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
    ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
    ds = ds.with_format("polars")
    expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
    ds = ds.map(lambda df: df.with_columns(expr), batched=True)
  • Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @​alex-hh in https://github.com/huggingface/datasets/pull/7207

    • IterableDatasets with "numpy" format are now much faster

What's Changed

New Contributors

Full Changelog: huggingface/datasets@3.2.0...3.3.0

v3.2.0

Compare Source

Dataset Features

  • Faster parquet streaming + filters with predicate pushdown by @​lhoestq in https://github.com/huggingface/datasets/pull/7309
    • Up to +100% streaming speed
    • Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
      from datasets import load_dataset
      filters = [('date', '>=', '2023')]
      ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)

Other improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.1.0...3.2.0

v3.1.0

Compare Source

Dataset Features

What's Changed

New Contributors

Full Changelog: huggingface/datasets@3.0.2...3.1.0

v3.0.2

Compare Source

Main bug fixes

What's Changed

New Contributors

Full Changelog: huggingface/datasets@3.0.1...3.0.2


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Never, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate-bot renovate-bot requested review from a team as code owners July 9, 2025 15:13
@trusted-contributions-gcf trusted-contributions-gcf bot added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Jul 9, 2025
@product-auto-label product-auto-label bot added samples Issues that are directly related to samples. api: people-and-planet-ai labels Jul 9, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @renovate-bot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to upgrade the datasets library to its latest major version, 4.0.0. This update integrates significant new features and performance enhancements, particularly around data streaming and media handling, while also incorporating several breaking changes that may require downstream adjustments.

Highlights

  • Dependency Update: This pull request updates the datasets library dependency from version 3.0.1 to 4.0.0 in pyproject.toml.
  • New Features in datasets v4.0.0: The updated datasets library introduces IterableDataset.push_to_hub() for faster uploads, a new Column object for efficient column iteration and lazy access, and torchcodec for optimized audio/video streaming with range-based decoding.
  • Breaking Changes in datasets v4.0.0: Key breaking changes include the removal of scripts (and trust_remote_code), the replacement of soundfile and decord with torchcodec for media decoding, and the introduction of the List type as a replacement for the legacy Sequence type.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kokoro-team kokoro-team removed kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Jul 9, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request updates the datasets dependency from version 3.0.1 to 4.0.0 in the pyproject.toml file. It's crucial to verify the compatibility of the codebase with the new version and address any potential breaking changes.

@@ -17,7 +17,7 @@
name = "weather-model"
version = "1.0.0"
dependencies = [
"datasets==3.0.1",
"datasets==4.0.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Updating the datasets dependency to version 4.0.0. Ensure that all functionalities and APIs used from the datasets library are compatible with this new version. Review the release notes to identify any breaking changes or deprecations that may affect the code. If there are breaking changes, make sure to update the code accordingly.

    "datasets==4.0.0", # Ensure compatibility with all functionalities used

@glasnt glasnt merged commit 538cd37 into GoogleCloudPlatform:main Jul 10, 2025
11 checks passed
@renovate-bot renovate-bot deleted the renovate/datasets-4.x branch July 10, 2025 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: people-and-planet-ai owlbot:run Add this label to trigger the Owlbot post processor. samples Issues that are directly related to samples.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants