Description
🐛 Bug
When using a StreamingDataset stored in Wasabi S3 Compatible Storage, storage_options
that work with the S3Client don't work with s5cmd, and I don't think it's possible to pass the required arguments to s5cmd.
from litdata import StreamingDataset
from botocore import UNSIGNED
from botocore.client import Config
storage_options = {
"endpoint_url": "https://s3.wasabisys.com",
"config": Config(signature_version=UNSIGNED)
}
# succeeds with S3Client, fails with s5cmd, can't figure out a workaround...
dataset = StreamingDataset("s3://visionlab-litdata/mnist/val",
storage_options=storage_options)
To Reproduce
The above code succeeds when using an environment without s5cmd installed, or when I edit litdata/streaming/downloader.py
to set self._s5cmd_available to False
nano +72 /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/downloader.py
pasting in:
self._s5cmd_available = False
Without that edit, s5cmd is used and throws an error, and as far as I can tell, there's no way to pass the required arguments through the StreamingDataset storage_options or system environment variables.
For example, this command works:
s5cmd --no-sign-request --endpoint-url https://s3.wasabisys.com cp s3://visionlab-litdata/mnist/val/index.json index.json
But s5cmd does not appear to use environment variables for either --no-sign-request
or --endpoint-url
, so for instance the following doesn't work.
export AWS_ENDPOINT_URL=https://s3.wasabisys.com
s5cmd --no-sign-request cp s3://visionlab-litdata/mnist/val/index.json index.json
Whereas the aws cli would use that environment variable, so the following succeeds
export AWS_ENDPOINT_URL=https://s3.wasabisys.com
aws s3 --no-sign-request cp s3://visionlab-litdata/mnist/val/index.json /tmp/index.json
I think the command in the S3Downloader class might need to be constructed differently to handle storage_options, hopefully in a way that's unform across S3Client and s5cmd? Or perhaps there needs to be a separate s5cmd_options argument?
if self._s5cmd_available:
env = None
if self._storage_options:
env = os.environ.copy()
env.update(self._storage_options)
proc = subprocess.Popen(
f"s5cmd cp {remote_filepath} {local_filepath}",
shell=True,
stdout=subprocess.PIPE,
env=env,
)
proc.wait()
Hopefully there's a workaround for passing the required args to s5cmd that I missed.
I suppose the minimal fix would be to allow the option to force using the S3Client, but that seems less desirable.
Otherwise the construction of the s5cmd command might need to be modified, something like the following
from botocore import UNSIGNED
from botocore.client import Config
if self._s5cmd_available:
env = os.environ.copy()
extra_args = ''
if self._storage_options:
unsigned = self._storage_options.get('unsigned')
config = self._storage_options.get('config', {})
signature_version = getattr(config, 'signature_version')
endpoint_url = self._storage_options.get('endpoint_url')
if unsigned or signature_version==UNSIGNED:
extra_args += f'--no-sign-request '
if endpoint_url:
extra_args += f'--endpoint-url {endpoint_url} '
cmd = f"s5cmd {extra_args.strip()} cp {remote_filepath} {local_filepath}"
proc = subprocess.Popen(
cmd,
shell=True,
stdout=subprocess.PIPE,
env=env,
)
proc.wait()
But this might be a headache to maintain if there other input arguments need to be taken into account that I'm not anticipating (bucket region?).
It might just be simpler to add a s5cmd_options argument to StreamingDataset, which in this case would have been s5cmd_options = "--no-sign-request --endpoint-url https://s3.wasabisys.com"
I guess the downside with this option is that as a user, you have to know what your backend is (S3Client or s5cmd), and might want to support both.
Expected behavior
I would expect a given set of storage_options to work for both supported backends (S3Client and s5cmd), or for there to be possible configuration options to pass the required args to s5cmd. e.g., it would be great if something like this would work with either S3Client or s5cmd.
from litdata import StreamingDataset
storage_options = {
"endpoint_url": "https://s3.wasabisys.com",
"unsigned": True,
}
dataset = StreamingDataset("s3://visionlab-litdata/mnist/val", storage_options=storage_options)
Environment detail
- Litdata Version: latest (0.2.42)
- 5cmd Version: 0.2.0
- OS: Linux lightning.ai studio