Skip to content

load_dataset('bigcode/the-stack-dedup', streaming=True) very slow! #5846

@tbenthompson

Description

@tbenthompson

Describe the bug

Running

import datasets
ds = datasets.load_dataset('bigcode/the-stack-dedup', streaming=True)

takes about 2.5 minutes!

I would expect this to be near instantaneous. With other datasets, the runtime is one or two seconds.

Environment info

  • datasets version: 2.11.0
  • Platform: macOS-13.3.1-arm64-arm-64bit
  • Python version: 3.10.10
  • Huggingface_hub version: 0.13.4
  • PyArrow version: 11.0.0
  • Pandas version: 2.0.0

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions