Description
Is your feature request related to a problem? Please describe.
For compliance and security reasons training and experimentation may be done in environments without access to the internet. While datasets provides a local cache it's not clear or documented how you would utilize an internal mirror or proxy for many distributed clients.
An internal mirror also helps prevent against transient network issues, improves performance and reduces network load on external services when using stateless machines that will have an empty cache on startup.
Describe the solution you'd like
I'd like to be able to export an environment variable like TFDS_MIRROR=s3://some-bucket/some-prefix/
and have tfds use that to fetch all files and make no network requests. A separate process can populate the mirror from a trusted environment with internet access. A simple key structure including the sha256
hash of the URL mapping to the data would be sufficient:
def download(url):
mirror_prefix = os.environ['TFDS_MIRROR']
url_hash = sha256(url)
try:
return download_from(f'{mirror_prefix}/{url_hash}')
except NotFound:
print(f'URL {url} not reachable')
Describe alternatives you've considered
You could mount a shared distributed filesystem on $TFDS_CACHE_DIR
, but tfds will still try and make network requests. Other machines might write to this location during runtime and the performance would not be optimal.