Skip to content

Dataset librispeech_asr fails to load #4179

@albertz

Description

@albertz

Describe the bug

The dataset librispeech_asr (standard Librispeech) fails to load.

Steps to reproduce the bug

datasets.load_dataset("librispeech_asr")

Expected results

It should download and prepare the whole dataset (all subsets).

In the doc, it says it has two configurations (clean and other).
However, the dataset doc says that not specifying split should just load the whole dataset, which is what I want.

Also, in case of this specific dataset, this is also the standard what the community uses. When you look at any publications with results on Librispeech, they always use the whole train dataset for training.

Actual results

...
  File "/home/az/.cache/huggingface/modules/datasets_modules/datasets/librispeech_asr/1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c/librispeech_asr.py", line 119, in LibrispeechASR._split_generators
    line: archive_path = dl_manager.download(_DL_URLS[self.config.name])
    locals:
      archive_path = <not found>
      dl_manager = <local> <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>
      dl_manager.download = <local> <bound method DownloadManager.download of <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>>
      _DL_URLS = <global> {'clean': {'dev': 'http://www.openslr.org/resources/12/dev-clean.tar.gz', 'test': 'http://www.openslr.org/resources/12/test-clean.tar.gz', 'train.100': 'http://www.openslr.org/resources/12/train-clean-100.tar.gz', 'train.360': 'http://www.openslr.org/resources/12/train-clean-360.tar.gz'}, 'other'...
      self = <local> <datasets_modules.datasets.librispeech_asr.1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c.librispeech_asr.LibrispeechASR object at 0x7fc12a633310>
      self.config = <local> BuilderConfig(name='default', version=0.0.0, data_dir='/home/az/i6/setups/2022-03-20--sis/work/i6_core/datasets/huggingface/DownloadAndPrepareHuggingFaceDatasetJob.TV6Nwm6dFReF/output/data_dir', data_files=None, description=None)
      self.config.name = <local> 'default', len = 7
KeyError: 'default'

Environment info

  • datasets version: 2.1.0
  • Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
  • Python version: 3.9.9
  • PyArrow version: 6.0.1
  • Pandas version: 1.4.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions