Skip to content

Releases: huggingface/datasets

3.5.0

27 Mar 16:38
0b5998a
Compare
Choose a tag to compare

Datasets Features

>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...

What's Changed

New Contributors

Full Changelog: 3.4.1...3.5.0

3.4.1

17 Mar 16:00
f742152
Compare
Choose a tag to compare

Bug Fixes

Full Changelog: 3.4.0...3.4.1

3.4.0

14 Mar 16:46
14fb15a
Compare
Choose a tag to compare

Dataset Features

  • Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in #7424

    • /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version
    from datasets import load_dataset, Video
    
    dataset = load_dataset("path/to/video/folder", split="train")
    dataset[0]["video"]  # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
    • faster streaming for image/audio/video folder from Hugging Face
    • support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
  • Add IterableDataset.decode with multithreading by @lhoestq in #7450

    • even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
    dataset = dataset.decode(num_threads=num_threads)
  • Add with_split to DatasetDict.map by @jp1924 in #7368

General improvements and bug fixes

New Contributors

Full Changelog: 3.3.2...3.4.0

3.3.2

20 Feb 17:44
b37230c
Compare
Choose a tag to compare

Bug fixes

  • Attempt to fix multiprocessing hang by closing and joining the pool before termination by @dakinggg in #7411
  • Gracefully cancel async tasks by @lhoestq in #7414

Other general improvements

New Contributors

Full Changelog: 3.3.1...3.3.2

3.3.1

17 Feb 14:53
4ead6ec
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 3.3.0...3.3.1

3.3.0

14 Feb 10:15
e9dae36
Compare
Choose a tag to compare

Dataset Features

  • Support async functions in map() by @lhoestq in #7384

    • Especially useful to download content like images or call inference APIs
    prompt = "Answer the following question: {question}. You should think step by step."
    async def ask_llm(example):
        return await query_model(prompt.format(question=example["question"]))
    ds = ds.map(ask_llm)
  • Add repeat method to datasets by @alex-hh in #7198

    ds = ds.repeat(10)
  • Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in #7370

    • Add support for "pandas" and "polars" formats in IterableDatasets
    • This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
    ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
    ds = ds.with_format("polars")
    expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
    ds = ds.map(lambda df: df.with_columns(expr), batched=True)
  • Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in #7207

    • IterableDatasets with "numpy" format are now much faster

What's Changed

New Contributors

Full Changelog: 3.2.0...3.3.0

3.2.0

10 Dec 17:00
fba4758
Compare
Choose a tag to compare

Dataset Features

  • Faster parquet streaming + filters with predicate pushdown by @lhoestq in #7309
    • Up to +100% streaming speed
    • Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
      from datasets import load_dataset
      filters = [('date', '>=', '2023')]
      ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)

Other improvements and bug fixes

New Contributors

Full Changelog: 3.1.0...3.2.0

3.1.0

31 Oct 15:21
dfb52e2
Compare
Choose a tag to compare

Dataset Features

  • Video support by @lhoestq in #7230
    >>> from datasets import Dataset, Video, load_dataset
    >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video())
    >>> # or from the hub
    >>> ds = load_dataset("username/dataset_name", split="train")
    >>> ds[0]["video"]
    <decord.video_reader.VideoReader at 0x105525c70>
  • Add IterableDataset.shard() by @lhoestq in #7252
    >>> from datasets import load_dataset
    >>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True)
    >>> full_ds.num_shards
    2360
    >>> ds = full_ds.shard(num_shards=ds.num_shards, index=0)
    >>> ds.num_shards
    1
    >>> ds = full_ds.shard(num_shards=8, index=0)
    >>> ds.num_shards
    295
  • Basic XML support by @lhoestq in #7250

What's Changed

New Contributors

Full Changelog: 3.0.2...3.1.0

3.0.2

22 Oct 15:03
97e5e17
Compare
Choose a tag to compare

Main bug fixes

What's Changed

New Contributors

Full Changelog: 3.0.1...3.0.2

3.0.1

26 Sep 08:27
679562d
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 3.0.0...3.0.1