Skip to content

DownloadManager with register checksum is much slower #2901

@adsnaider

Description

@adsnaider

Short description
When using the DownloadManager to download many small files (1M+ images), if register checksum is disabled, the download seems to go relatively fast. However, if register checksums is enabled, then the download is painfully slow. We are talking about multiple orders of magnitude difference. I'm doing this with a non-beam dataset.
I'm unsure if this has something to do with the parallelization of the downloads. The documentation says that if the dl_manager receives a data structure to download it will parallelize it. Does parallelization not work when register checksums is enabled?
If this is the case, at the very least it would be nice to update the documentation to clarify this.

Environment information

  • Operating System: Ubuntu 20.04

  • Python version: 3.8.5

  • tensorflow-datasets version: 4.1.0

  • tensorflow version: 2.3.1

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) Yes

Reproduction instructions

I'm using dl_manager to download files from S3. So to reproduce this issue, we can try comparing at the speed of downloading multiple small files from S3, once with register_checksums enabled and once disabled. In my case, the size of the dataset is upwards of 70GB, but I don't believe this needs to be the case: a couple GB will probably be enough.

Expected behavior
I expected the download speed to not change so drastically due to checksums registration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions