Description
Describe the bug
The behaviour of ListingTableUrl with respect to paths containing percent characters is rather confusing, and I suspect not entirely intentional.
Consider a filesystem containing a file named bar%2Ffoo
, there is actually no obvious way to address this file.
let url = ListingTableUrl::parse("file:///foo/bar%2Ffoo").unwrap();
assert_eq!(url.prefix.as_ref(), "foo/bar/foo");
let url = ListingTableUrl::parse("file:///foo/a%252Fb.txt").unwrap();
assert_eq!(url.prefix.as_ref(), "foo/a%252Fb.txt");
let dir = tempdir().unwrap();
let path = dir.path().join("bar%2Ffoo");
std::fs::File::create(&path).unwrap();
let url = ListingTableUrl::parse(path.to_str().unwrap()).unwrap();
assert!(url.prefix.as_ref().ends_with("bar%252Ffoo"), "{}", url.prefix);
To Reproduce
No response
Expected behavior
The "correct" behaviour is that a file URL should be URL-encoded. That is according to the URL specification the correct way to reference this path would be file:///foo/a%252Fb.txt
. Similarly the non-URL version should be foo/a%2Fb.txt
.
That being said various tools instead interpret the URL path verbatim:
$ touch 'a%2Fb.txt'
$ aws --endpoint-url=http://localhost:4566 s3 cp 'a%2Fb.txt' s3://tustvold/
$ aws --endpoint-url=http://localhost:4566 s3 ls s3://tustvold/
2023-10-31 15:40:13 0 a%2Fb.txt
$ aws --endpoint-url=http://localhost:4566 s3 cp 's3://tustvold/a%2Fb.txt' foo.txt
aws --endpoint-url=http://localhost:4566 s3 cp 's3://tustvold/a%2Fb.txt' foo.txt
$ gsutil cp a\%2Fb.txt gs://tustvold
$ gsutil cp gs://tustvold/a\%2Fb.txt test
I'm not entirely sure how to classify DataFusion's current behaviour other than confusing. I think we should probably strive to replicate tools like the aws-cli and gsutil.
Additional context
The current behaviour was added in #3750
No response