-
Notifications
You must be signed in to change notification settings - Fork 6.6k
fix(deps): update dependency datasets to v4 #13502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(deps): update dependency datasets to v4 #13502
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @renovate-bot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request aims to upgrade the datasets
library to its latest major version, 4.0.0
. This update integrates significant new features and performance enhancements, particularly around data streaming and media handling, while also incorporating several breaking changes that may require downstream adjustments.
Highlights
- Dependency Update: This pull request updates the
datasets
library dependency from version3.0.1
to4.0.0
inpyproject.toml
. - New Features in
datasets
v4.0.0: The updateddatasets
library introducesIterableDataset.push_to_hub()
for faster uploads, a newColumn
object for efficient column iteration and lazy access, andtorchcodec
for optimized audio/video streaming with range-based decoding. - Breaking Changes in
datasets
v4.0.0: Key breaking changes include the removal ofscripts
(andtrust_remote_code
), the replacement ofsoundfile
anddecord
withtorchcodec
for media decoding, and the introduction of theList
type as a replacement for the legacySequence
type.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request updates the datasets
dependency from version 3.0.1 to 4.0.0 in the pyproject.toml
file. It's crucial to verify the compatibility of the codebase with the new version and address any potential breaking changes.
@@ -17,7 +17,7 @@ | |||
name = "weather-model" | |||
version = "1.0.0" | |||
dependencies = [ | |||
"datasets==3.0.1", | |||
"datasets==4.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the datasets dependency to version 4.0.0. Ensure that all functionalities and APIs used from the datasets
library are compatible with this new version. Review the release notes to identify any breaking changes or deprecations that may affect the code. If there are breaking changes, make sure to update the code accordingly.
"datasets==4.0.0", # Ensure compatibility with all functionalities used
This PR contains the following updates:
==3.0.1
->==4.0.0
Release Notes
huggingface/datasets (datasets)
v4.0.0
Compare Source
New Features
Add
IterableDataset.push_to_hub()
by @lhoestq in https://github.com/huggingface/datasets/pull/7595Build streaming data pipelines in a few lines of code !
from datasets import load_dataset
ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)
New
Column
objectSyntax:
ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)
Iterate on a column:
for text in ds["text"]:
...
Load one cell without bringing the full column in memory
first_text = ds["text"][0] # equivalent to ds[0]["text"]
torch>=2.7.0
and FFmpeg >= 4datasets<4.0
AudioDecoder
:VideoDecoder
:Breaking changes
Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
trust_remote_code
is no longer supportedTorchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634
List
typeSequence
was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aList
or adict
depending on the subfeatureOther improvements and bug fixes
Dataset.map
to reuse cache files mapped with differentnum_proc
by @ringohoffman in https://github.com/huggingface/datasets/pull/7434RepeatExamplesIterable
by @SilvanCodes in https://github.com/huggingface/datasets/pull/7581_dill.py
to useco_linetable
for Python 3.10+ in place ofco_lnotab
by @qgallouedec in https://github.com/huggingface/datasets/pull/7609New Contributors
Full Changelog: huggingface/datasets@3.6.0...4.0.0
v3.6.0
Compare Source
Dataset Features
Other improvements and bug fixes
aiohttp
from direct dependencies by @akx in https://github.com/huggingface/datasets/pull/7294New Contributors
Full Changelog: huggingface/datasets@3.5.1...3.6.0
v3.5.1
Compare Source
Bug fixes
TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
Other improvements
New Contributors
Full Changelog: huggingface/datasets@3.5.0...3.5.1
v3.5.0
Compare Source
Datasets Features
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.4.1...3.5.0
v3.4.1
Compare Source
Bug Fixes
Full Changelog: huggingface/datasets@3.4.0...3.4.1
v3.4.0
Compare Source
Dataset Features
Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in https://github.com/huggingface/datasets/pull/7424
decord
withtorchvision
to read videos, sincedecord
is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. TheVideo
type is still marked as experimental is this versionmetadata.parquet
in addition tometadata.csv
ormetadata.jsonl
for the metadata of the image/audio/video filesAdd IterableDataset.decode with multithreading by @lhoestq in https://github.com/huggingface/datasets/pull/7450
Add with_split to DatasetDict.map by @jp1924 in https://github.com/huggingface/datasets/pull/7368
General improvements and bug fixes
string_to_dict
to returnNone
if there is no match instead of raisingValueError
by @ringohoffman in https://github.com/huggingface/datasets/pull/7435ds.set_epoch(new_epoch)
by @lhoestq in https://github.com/huggingface/datasets/pull/7451New Contributors
Full Changelog: huggingface/datasets@3.3.2...3.4.0
v3.3.2
Compare Source
Bug fixes
Other general improvements
New Contributors
Full Changelog: huggingface/datasets@3.3.1...3.3.2
v3.3.1
Compare Source
Bug fixes
Full Changelog: huggingface/datasets@3.3.0...3.3.1
v3.3.0
Compare Source
Dataset Features
Support async functions in map() by @lhoestq in https://github.com/huggingface/datasets/pull/7384
Add repeat method to datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7198
Support faster processing using pandas or polars functions in
IterableDataset.map()
by @lhoestq in https://github.com/huggingface/datasets/pull/7370Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7207
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.2.0...3.3.0
v3.2.0
Compare Source
Dataset Features
Other improvements and bug fixes
ClassLabel
by @sergiopaniego in https://github.com/huggingface/datasets/pull/7293New Contributors
Full Changelog: huggingface/datasets@3.1.0...3.2.0
v3.1.0
Compare Source
Dataset Features
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.0.2...3.1.0
v3.0.2
Compare Source
Main bug fixes
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.0.1...3.0.2
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Never, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.