Releases: huggingface/datasets
Releases · huggingface/datasets
3.5.0
Datasets Features
- Introduce PDF support (#7318) by @yabramuvdi in #7325
>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder" # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...
What's Changed
- Fix local pdf loading by @lhoestq in #7466
- Minor fix for metadata files in extension counter by @lhoestq in #7464
- Priotitize json by @lhoestq in #7476
New Contributors
- @yabramuvdi made their first contribution in #7325
Full Changelog: 3.4.1...3.5.0
3.4.1
3.4.0
Dataset Features
-
Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in #7424
- /!\ Breaking change: we replaced
decord
withtorchvision
to read videos, sincedecord
is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. TheVideo
type is still marked as experimental is this version
from datasets import load_dataset, Video dataset = load_dataset("path/to/video/folder", split="train") dataset[0]["video"] # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
- faster streaming for image/audio/video folder from Hugging Face
- support for
metadata.parquet
in addition tometadata.csv
ormetadata.jsonl
for the metadata of the image/audio/video files
- /!\ Breaking change: we replaced
-
Add IterableDataset.decode with multithreading by @lhoestq in #7450
- even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
dataset = dataset.decode(num_threads=num_threads)
General improvements and bug fixes
- fix: None default with bool type on load creates typing error by @stephantul in #7426
- Use pyupgrade --py39-plus by @cyyever in #7428
- Refactor
string_to_dict
to returnNone
if there is no match instead of raisingValueError
by @ringohoffman in #7435 - Fix small bugs with async map by @lhoestq in #7445
- Fix resuming after
ds.set_epoch(new_epoch)
by @lhoestq in #7451 - minor docs changes by @lhoestq in #7452
New Contributors
- @stephantul made their first contribution in #7426
- @cyyever made their first contribution in #7428
- @jp1924 made their first contribution in #7368
Full Changelog: 3.3.2...3.4.0
3.3.2
Bug fixes
- Attempt to fix multiprocessing hang by closing and joining the pool before termination by @dakinggg in #7411
- Gracefully cancel async tasks by @lhoestq in #7414
Other general improvements
- Update use_with_pandas.mdx: to_pandas() correction in last section by @ibarrien in #7407
- Fix a typo in arrow_dataset.py by @jingedawang in #7402
New Contributors
- @dakinggg made their first contribution in #7411
- @ibarrien made their first contribution in #7407
- @jingedawang made their first contribution in #7402
Full Changelog: 3.3.1...3.3.2
3.3.1
3.3.0
Dataset Features
-
Support async functions in map() by @lhoestq in #7384
- Especially useful to download content like images or call inference APIs
prompt = "Answer the following question: {question}. You should think step by step." async def ask_llm(example): return await query_model(prompt.format(question=example["question"])) ds = ds.map(ask_llm)
-
Add repeat method to datasets by @alex-hh in #7198
ds = ds.repeat(10)
-
Support faster processing using pandas or polars functions in
IterableDataset.map()
by @lhoestq in #7370- Add support for "pandas" and "polars" formats in IterableDatasets
- This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True) ds = ds.with_format("polars") expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution") ds = ds.map(lambda df: df.with_columns(expr), batched=True)
-
Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in #7207
- IterableDatasets with "numpy" format are now much faster
What's Changed
- don't import soundfile in tests by @lhoestq in #7340
- minor video docs on how to install by @lhoestq in #7341
- Fix typo in arrow_dataset by @AndreaFrancis in #7328
- remove filecheck to enable symlinks by @fschlatt in #7133
- Webdataset special columns in last position by @lhoestq in #7349
- Bump hfh to 0.24 to fix ci by @lhoestq in #7350
- fsspec 2024.12.0 by @lhoestq in #7352
- changes to MappedExamplesIterable to resolve #7345 by @vttrifonov in #7353
- Catch OSError for arrow by @lhoestq in #7348
- Remove .h5 from imagefolder extensions by @lhoestq in #7374
- Add Pandas, PyArrow and Polars docs by @lhoestq in #7382
- Optimized sequence encoding for scalars by @lukasgd in #7393
- Update docs by @lhoestq in #7395
- Update README.md by @lhoestq in #7396
- Release: 3.3.0 by @lhoestq in #7398
New Contributors
- @AndreaFrancis made their first contribution in #7328
- @vttrifonov made their first contribution in #7353
- @lukasgd made their first contribution in #7393
Full Changelog: 3.2.0...3.3.0
3.2.0
Dataset Features
- Faster parquet streaming + filters with predicate pushdown by @lhoestq in #7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
from datasets import load_dataset filters = [('date', '>=', '2023')] ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
Other improvements and bug fixes
- fix conda release worlflow by @lhoestq in #7272
- Add link to video dataset by @NielsRogge in #7277
- Raise error for incorrect JSON serialization by @varadhbhatnagar in #7273
- support for custom feature encoding/decoding by @alex-hh in #7284
- update load_dataset doctring by @lhoestq in #7301
- Let server decide default repo visibility by @Wauplin in #7302
- fix: update elasticsearch version by @ruidazeng in #7300
- Fix typing in iterable_dataset.py by @lhoestq in #7304
- Updated inconsistent output in documentation examples for
ClassLabel
by @sergiopaniego in #7293 - More docs to from_dict to mention that the result lives in RAM by @lhoestq in #7316
- Release: 3.2.0 by @lhoestq in #7317
New Contributors
- @ruidazeng made their first contribution in #7300
- @sergiopaniego made their first contribution in #7293
Full Changelog: 3.1.0...3.2.0
3.1.0
Dataset Features
- Video support by @lhoestq in #7230
>>> from datasets import Dataset, Video, load_dataset >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video()) >>> # or from the hub >>> ds = load_dataset("username/dataset_name", split="train") >>> ds[0]["video"] <decord.video_reader.VideoReader at 0x105525c70>
- Add IterableDataset.shard() by @lhoestq in #7252
>>> from datasets import load_dataset >>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True) >>> full_ds.num_shards 2360 >>> ds = full_ds.shard(num_shards=ds.num_shards, index=0) >>> ds.num_shards 1 >>> ds = full_ds.shard(num_shards=8, index=0) >>> ds.num_shards 295
- Basic XML support by @lhoestq in #7250
What's Changed
- (Super tiny doc update) Mention to_polars by @fzyzcjy in #7232
- [MINOR:TYPO] Update arrow_dataset.py by @cakiki in #7236
- Missing video docs by @lhoestq in #7251
- fix decord import by @lhoestq in #7255
- fix ci for pyarrow 18 by @lhoestq in #7257
- Retry all requests timeouts by @lhoestq in #7256
- Always set non-null writer batch size by @lhoestq in #7258
- Don't embed videos by @lhoestq in #7259
- Allow video with disabeld decoding without decord by @lhoestq in #7262
- Small addition to video docs by @lhoestq in #7263
- fix docs relative links by @lhoestq in #7264
- Disallow video push_to_hub by @lhoestq in #7265
New Contributors
Full Changelog: 3.0.2...3.1.0
3.0.2
Main bug fixes
- fix unbatched arrow map for iterable datasets by @alex-hh in #7204
- Support features in metadata configs by @albertvillanova in #7182
- Preserve features in iterable dataset.filter by @alex-hh in #7209
- Pin dill<0.3.9 to fix CI by @albertvillanova in #7184
- this should also fix cache issues
What's Changed
- Fix release instructions by @albertvillanova in #7177
- Pin multiprocess<0.70.1 to align with dill<0.3.9 by @albertvillanova in #7188
- with_format docstring by @lhoestq in #7203
- fix ci benchmark by @lhoestq in #7205
- Fix the environment variable for huggingface cache by @torotoki in #7200
- Support Python 3.11 by @albertvillanova in #7179
- bump fsspec by @lhoestq in #7219
- Fix typo in image dataset docs by @albertvillanova in #7231
- No need for dataset_info by @lhoestq in #7234
- use huggingface_hub offline mode by @lhoestq in #7244
New Contributors
Full Changelog: 3.0.1...3.0.2
3.0.1
What's Changed
- Modify add_column() to optionally accept a FeatureType as param by @varadhbhatnagar in #7143
- Align filename prefix splitting with WebDataset library by @albertvillanova in #7151
- Support ndjson data files by @albertvillanova in #7154
- Support JSON lines with missing struct fields by @albertvillanova in #7160
- Support JSON lines with empty struct by @albertvillanova in #7162
- fix increase_load_count by @lhoestq in #7165
- fix docstring code example for distributed shuffle by @lhoestq in #7166
- Support JSON lines with missing columns by @albertvillanova in #7170
- Add torchdata as a regular test dependency by @albertvillanova in #7172
New Contributors
- @varadhbhatnagar made their first contribution in #7143
Full Changelog: 3.0.0...3.0.1