Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEUTRAL] Update dependency datasets to v3 #8

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mend-for-github-com[bot]
Copy link

@mend-for-github-com mend-for-github-com bot commented Nov 5, 2024

This PR contains the following updates:

Package Change Age Adoption Passing Confidence
datasets >=2.14.6,<2.17 -> >=3.2,<3.3 age adoption passing confidence

Release Notes

huggingface/datasets (datasets)

v3.2.0

Compare Source

Dataset Features
  • Faster parquet streaming + filters with predicate pushdown by @​lhoestq in https://github.com/huggingface/datasets/pull/7309
    • Up to +100% streaming speed
    • Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
      from datasets import load_dataset
      filters = [('date', '>=', '2023')]
      ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
Other improvements and bug fixes
New Contributors

Full Changelog: huggingface/datasets@3.1.0...3.2.0

v3.1.0

Compare Source

Dataset Features
What's Changed
New Contributors

Full Changelog: huggingface/datasets@3.0.2...3.1.0

v3.0.2

Compare Source

Main bug fixes
What's Changed
New Contributors

Full Changelog: huggingface/datasets@3.0.1...3.0.2

v3.0.1

Compare Source

What's Changed
New Contributors

Full Changelog: huggingface/datasets@3.0.0...3.0.1

v3.0.0

Compare Source

Dataset Features
  • Use Polars functions in .map()
    • Allow Polars as valid output type by @​psmyth94 in https://github.com/huggingface/datasets/pull/6762

    • Example:

      >>> from datasets import load_dataset
      >>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars")
      >>> cols = [pl.col("content").str.len_bytes().alias("length")]
      >>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True)
      >>> ds_with_length[:5]
      shape: (5, 5)
      ┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐
      │ idxtitlecontentlabelslength │
      │ ---------------    │
      │ i64strstrstru32    │
      ╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡
      │ 0The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure180    │
      │ 1Pikachu's Quest for PeacePikachu, with his cheeky persona… ┆ peaceful_narrative138    │
      │ 2The Tender Tale of SquirtleSquirtle took everyone on a memo… ┆ gentle_adventure135    │
      │ 3Charizard's Heartwarming TaleCharizard found joy in helping o… ┆ heartwarming_story112    │
      │ 4Jolteon's Sparkling JourneyJolteon, with his zest for life,… ┆ celebratory_narrative111    │
      └─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘
  • Support NumPy 2
Cache Changes
  • Use huggingface_hub cache by @​lhoestq in https://github.com/huggingface/datasets/pull/7105
    • use the huggingface_hub cache for files downloaded from HF, by default at ~/.cache/huggingface/hub
    • cached datasets (Arrow files) will still be reloaded from the datasets cache, by default at ~/.cache/huggingface/datasets
Breaking changes
General improvements and bug fixes
New Contributors

Full Changelog: huggingface/datasets@2.21.0...3.0.0

v2.21.0

Compare Source

Features
What's Changed
New Contributors

Full Changelog: huggingface/datasets@2.20.0...2.21.0

v2.20.0

Compare Source

Important
Datasets features
  • [Resumable IterableDataset] Add IterableDataset state_dict by @​lhoestq in https://github.com/huggingface/datasets/pull/6658
    • checkpoint and resume an iterable dataset (e.g. when streaming):

      >>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
      >>> for idx, example in enumerate(iterable_dataset):
      ...     print(example)
      ...     if idx == 2:
      ...         state_dict = iterable_dataset.state_dict()
      ...         print("checkpoint")
      ...         break
      >>> iterable_dataset.load_state_dict(state_dict)
      >>> print(f"restart from checkpoint")
      >>> for example in iterable_dataset:
      ...     print(example)

      Returns:

      {'a': 0}
      {'a': 1}
      {'a': 2}
      checkpoint
      restart from checkpoint
      {'a': 3}
      {'a': 4}
      {'a': 5}
      
General improvements and bug fixes
New Contributors

Full Changelog: huggingface/datasets@2.19.0...2.20.0

v2.19.2

Compare Source

Bug fixes

Full Changelog: huggingface/datasets@2.19.1...2.19.2

v2.19.1

Compare Source

Bug fixes

Full Changelog: huggingface/datasets@2.19.0...2.19.1

v2.19.0

Compare Source

Dataset Features
General improvements and bug fixes

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.


  • If you want to rebase/retry this PR, check this box

@mend-for-github-com mend-for-github-com bot changed the title chore(deps): update dependency datasets to v3 [NEUTRAL] Update dependency datasets to v3 Nov 6, 2024
@mend-for-github-com mend-for-github-com bot force-pushed the whitesource-remediate/datasets-3.x branch from 4a42841 to cfab78f Compare December 30, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants