Releases · huggingface/datasets

27 Mar 16:38

lhoestq

3.5.0

0b5998a

3.5.0 Latest

Latest

Datasets Features

Introduce PDF support (#7318) by @yabramuvdi in #7325

>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...

What's Changed

Fix local pdf loading by @lhoestq in #7466
Minor fix for metadata files in extension counter by @lhoestq in #7464
Priotitize json by @lhoestq in #7476

New Contributors

@yabramuvdi made their first contribution in #7325

Full Changelog: 3.4.1...3.5.0

Contributors

yabramuvdi and lhoestq

Assets 2

17 Mar 16:00

lhoestq

3.4.1

f742152

3.4.1

Bug Fixes

Fix data_files filtering by @lhoestq in #7459

Full Changelog: 3.4.0...3.4.1

Contributors

lhoestq

Assets 2

14 Mar 16:46

lhoestq

3.4.0

14fb15a

3.4.0

Dataset Features

Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in #7424
- /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version
```
from datasets import load_dataset, Video

dataset = load_dataset("path/to/video/folder", split="train")
dataset[0]["video"]  # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
```
- faster streaming for image/audio/video folder from Hugging Face
- support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
Add IterableDataset.decode with multithreading by @lhoestq in #7450
- even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
```
dataset = dataset.decode(num_threads=num_threads)
```
Add with_split to DatasetDict.map by @jp1924 in #7368

General improvements and bug fixes

fix: None default with bool type on load creates typing error by @stephantul in #7426
Use pyupgrade --py39-plus by @cyyever in #7428
Refactor string_to_dict to return None if there is no match instead of raising ValueError by @ringohoffman in #7435
Fix small bugs with async map by @lhoestq in #7445
Fix resuming after ds.set_epoch(new_epoch) by @lhoestq in #7451
minor docs changes by @lhoestq in #7452

New Contributors

@stephantul made their first contribution in #7426
@cyyever made their first contribution in #7428
@jp1924 made their first contribution in #7368

Full Changelog: 3.3.2...3.4.0

Contributors

stephantul, cyyever, and 3 other contributors

Assets 2

20 Feb 17:44

lhoestq

3.3.2

b37230c

3.3.2

Bug fixes

Attempt to fix multiprocessing hang by closing and joining the pool before termination by @dakinggg in #7411
Gracefully cancel async tasks by @lhoestq in #7414

Other general improvements

Update use_with_pandas.mdx: to_pandas() correction in last section by @ibarrien in #7407
Fix a typo in arrow_dataset.py by @jingedawang in #7402

New Contributors

@dakinggg made their first contribution in #7411
@ibarrien made their first contribution in #7407
@jingedawang made their first contribution in #7402

Full Changelog: 3.3.1...3.3.2

Contributors

ibarrien, jingedawang, and 2 other contributors

Assets 2

17 Feb 14:53

lhoestq

3.3.1

4ead6ec

3.3.1

Bug fixes

Fix filter speed regression by @lhoestq in #7408

Full Changelog: 3.3.0...3.3.1

Contributors

lhoestq

Assets 2

14 Feb 10:15

lhoestq

3.3.0

e9dae36

3.3.0

Dataset Features

Support async functions in map() by @lhoestq in #7384

Especially useful to download content like images or call inference APIs

prompt = "Answer the following question: {question}. You should think step by step."
async def ask_llm(example):
    return await query_model(prompt.format(question=example["question"]))
ds = ds.map(ask_llm)

Add repeat method to datasets by @alex-hh in #7198
```
ds = ds.repeat(10)
```

Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in #7370

Add support for "pandas" and "polars" formats in IterableDatasets
This enables optimized data processing using pandas or polars functions with zero-copy, e.g.

ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
ds = ds.with_format("polars")
expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
ds = ds.map(lambda df: df.with_columns(expr), batched=True)

Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in #7207
- IterableDatasets with "numpy" format are now much faster

What's Changed

don't import soundfile in tests by @lhoestq in #7340
minor video docs on how to install by @lhoestq in #7341
Fix typo in arrow_dataset by @AndreaFrancis in #7328
remove filecheck to enable symlinks by @fschlatt in #7133
Webdataset special columns in last position by @lhoestq in #7349
Bump hfh to 0.24 to fix ci by @lhoestq in #7350
fsspec 2024.12.0 by @lhoestq in #7352
changes to MappedExamplesIterable to resolve #7345 by @vttrifonov in #7353
Catch OSError for arrow by @lhoestq in #7348
Remove .h5 from imagefolder extensions by @lhoestq in #7374
Add Pandas, PyArrow and Polars docs by @lhoestq in #7382
Optimized sequence encoding for scalars by @lukasgd in #7393
Update docs by @lhoestq in #7395
Update README.md by @lhoestq in #7396
Release: 3.3.0 by @lhoestq in #7398

New Contributors

@AndreaFrancis made their first contribution in #7328
@vttrifonov made their first contribution in #7353
@lukasgd made their first contribution in #7393

Full Changelog: 3.2.0...3.3.0

Contributors

AndreaFrancis, alex-hh, and 4 other contributors

Assets 2

10 Dec 17:00

lhoestq

3.2.0

fba4758

3.2.0

Dataset Features

Faster parquet streaming + filters with predicate pushdown by @lhoestq in #7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
```
from datasets import load_dataset
filters = [('date', '>=', '2023')]
ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
```

Other improvements and bug fixes

fix conda release worlflow by @lhoestq in #7272
Add link to video dataset by @NielsRogge in #7277
Raise error for incorrect JSON serialization by @varadhbhatnagar in #7273
support for custom feature encoding/decoding by @alex-hh in #7284
update load_dataset doctring by @lhoestq in #7301
Let server decide default repo visibility by @Wauplin in #7302
fix: update elasticsearch version by @ruidazeng in #7300
Fix typing in iterable_dataset.py by @lhoestq in #7304
Updated inconsistent output in documentation examples for ClassLabel by @sergiopaniego in #7293
More docs to from_dict to mention that the result lives in RAM by @lhoestq in #7316
Release: 3.2.0 by @lhoestq in #7317

New Contributors

@ruidazeng made their first contribution in #7300
@sergiopaniego made their first contribution in #7293

Full Changelog: 3.1.0...3.2.0

Contributors

alex-hh, Wauplin, and 5 other contributors

Assets 2

31 Oct 15:21

lhoestq

3.1.0

dfb52e2

3.1.0

Dataset Features

Video support by @lhoestq in #7230

>>> from datasets import Dataset, Video, load_dataset
>>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video())
>>> # or from the hub
>>> ds = load_dataset("username/dataset_name", split="train")
>>> ds[0]["video"]
<decord.video_reader.VideoReader at 0x105525c70>

Add IterableDataset.shard() by @lhoestq in #7252

>>> from datasets import load_dataset
>>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True)
>>> full_ds.num_shards
2360
>>> ds = full_ds.shard(num_shards=ds.num_shards, index=0)
>>> ds.num_shards
1
>>> ds = full_ds.shard(num_shards=8, index=0)
>>> ds.num_shards
295

Basic XML support by @lhoestq in #7250

What's Changed

(Super tiny doc update) Mention to_polars by @fzyzcjy in #7232
[MINOR:TYPO] Update arrow_dataset.py by @cakiki in #7236
Missing video docs by @lhoestq in #7251
fix decord import by @lhoestq in #7255
fix ci for pyarrow 18 by @lhoestq in #7257
Retry all requests timeouts by @lhoestq in #7256
Always set non-null writer batch size by @lhoestq in #7258
Don't embed videos by @lhoestq in #7259
Allow video with disabeld decoding without decord by @lhoestq in #7262
Small addition to video docs by @lhoestq in #7263
fix docs relative links by @lhoestq in #7264
Disallow video push_to_hub by @lhoestq in #7265

New Contributors

@fzyzcjy made their first contribution in #7232

Full Changelog: 3.0.2...3.1.0

Contributors

cakiki, fzyzcjy, and lhoestq

Assets 2

22 Oct 15:03

lhoestq

3.0.2

97e5e17

3.0.2

Main bug fixes

fix unbatched arrow map for iterable datasets by @alex-hh in #7204
Support features in metadata configs by @albertvillanova in #7182
Preserve features in iterable dataset.filter by @alex-hh in #7209
Pin dill<0.3.9 to fix CI by @albertvillanova in #7184
- this should also fix cache issues

What's Changed

Fix release instructions by @albertvillanova in #7177
Pin multiprocess<0.70.1 to align with dill<0.3.9 by @albertvillanova in #7188
with_format docstring by @lhoestq in #7203
fix ci benchmark by @lhoestq in #7205
Fix the environment variable for huggingface cache by @torotoki in #7200
Support Python 3.11 by @albertvillanova in #7179
bump fsspec by @lhoestq in #7219
Fix typo in image dataset docs by @albertvillanova in #7231
No need for dataset_info by @lhoestq in #7234
use huggingface_hub offline mode by @lhoestq in #7244

New Contributors

@alex-hh made their first contribution in #7204
@torotoki made their first contribution in #7200

Full Changelog: 3.0.1...3.0.2

Contributors

torotoki, alex-hh, and 2 other contributors

Assets 2

26 Sep 08:27

albertvillanova

3.0.1

679562d

3.0.1

What's Changed

Modify add_column() to optionally accept a FeatureType as param by @varadhbhatnagar in #7143
Align filename prefix splitting with WebDataset library by @albertvillanova in #7151
Support ndjson data files by @albertvillanova in #7154
Support JSON lines with missing struct fields by @albertvillanova in #7160
Support JSON lines with empty struct by @albertvillanova in #7162
fix increase_load_count by @lhoestq in #7165
fix docstring code example for distributed shuffle by @lhoestq in #7166
Support JSON lines with missing columns by @albertvillanova in #7170
Add torchdata as a regular test dependency by @albertvillanova in #7172

New Contributors

@varadhbhatnagar made their first contribution in #7143

Full Changelog: 3.0.0...3.0.1

Contributors

albertvillanova, varadhbhatnagar, and lhoestq

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets Features

What's Changed

New Contributors

Contributors

Bug Fixes

Contributors

Dataset Features

General improvements and bug fixes

New Contributors

Contributors

Bug fixes

Other general improvements

New Contributors

Contributors

Bug fixes

Contributors

Dataset Features

What's Changed

New Contributors

Contributors

Dataset Features

Other improvements and bug fixes

New Contributors

Contributors

Dataset Features

What's Changed

New Contributors

Contributors

Main bug fixes

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Releases: huggingface/datasets

3.5.0

Datasets Features

What's Changed

New Contributors

Contributors

3.4.1

Bug Fixes

Contributors

3.4.0

Dataset Features

General improvements and bug fixes

New Contributors

Contributors

3.3.2

Bug fixes

Other general improvements

New Contributors

Contributors

3.3.1

Bug fixes

Contributors

3.3.0

Dataset Features

What's Changed

New Contributors

Contributors

3.2.0

Dataset Features

Other improvements and bug fixes

New Contributors

Contributors

3.1.0

Dataset Features

What's Changed

New Contributors

Contributors

3.0.2

Main bug fixes

What's Changed

New Contributors

Contributors

3.0.1

What's Changed

New Contributors

Contributors