Skip to content

Support for mirroring datasets for use without an internet connection #3125

Open
@orf

Description

Is your feature request related to a problem? Please describe.

For compliance and security reasons training and experimentation may be done in environments without access to the internet. While datasets provides a local cache it's not clear or documented how you would utilize an internal mirror or proxy for many distributed clients.

An internal mirror also helps prevent against transient network issues, improves performance and reduces network load on external services when using stateless machines that will have an empty cache on startup.

Describe the solution you'd like

I'd like to be able to export an environment variable like TFDS_MIRROR=s3://some-bucket/some-prefix/ and have tfds use that to fetch all files and make no network requests. A separate process can populate the mirror from a trusted environment with internet access. A simple key structure including the sha256 hash of the URL mapping to the data would be sufficient:

def download(url):
    mirror_prefix = os.environ['TFDS_MIRROR']
    url_hash = sha256(url)
    try:
        return download_from(f'{mirror_prefix}/{url_hash}')
    except NotFound:
        print(f'URL {url} not reachable')

Describe alternatives you've considered

You could mount a shared distributed filesystem on $TFDS_CACHE_DIR, but tfds will still try and make network requests. Other machines might write to this location during runtime and the performance would not be optimal.

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions