Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCSFS reports directory as FileNotFoundError when it exists. Run 1 fails, run 2 succeeds. Caching? #632

Open
pascalwhoop opened this issue Jul 18, 2024 · 5 comments

Comments

@pascalwhoop
Copy link

pascalwhoop commented Jul 18, 2024

Hi,
We went down a rabbit hole trying to find this one.
apache/arrow#31339
Turns out Pandas can't read partitioned parquet files from a directory because of PyArrow using GCSFS.

However in this repo there seems to be no mention of this. Are you aware of any situation where the library is non-deterministic/has caching issues when listing a directory?

import gcsfs
PATH = "bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes"
fs = gcsfs.GCSFileSystem()
print(fs.info(PATH))
print(fs.info(PATH))
print(fs.info(PATH))

Returns:

{'kind': 'storage#object', 'id': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes//1721313663057121', 'selfLink': 'https://www.googleapis.com/storage/v1/b/bucket-dev-storage/o/kedro%2Fstaging%2Fdata%2F05_model_input%2Fdrugs_diseases_nodes%2F', 'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/bucket-dev-storage/o/kedro%2Fstaging%2Fdata%2F05_model_input%2Fdrugs_diseases_nodes%2F?generation=1721313663057121&alt=media', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes/', 'bucket': 'bucket-dev-storage', 'generation': '1721313663057121', 'metageneration': '1', 'contentType': 'application/octet-stream', 'storageClass': 'STANDARD', 'size': 0, 'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==', 'crc32c': 'AAAAAA==', 'etag': 'COH5u4vpsIcDEAE=', 'timeCreated': '2024-07-18T14:41:03.059Z', 'updated': '2024-07-18T14:41:03.059Z', 'timeStorageClassUpdated': '2024-07-18T14:41:03.059Z', 'type': 'file'}
{'bucket': 'bucket-dev-storage', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}
{'bucket': 'bucket-dev-storage', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}

Note first call vs. 2 and 3 are different results. What's up with that?

@martindurant
Copy link
Member

You seem to have a key and a directory with the same name, which is unfortunate. While it is unclear which of these gcsfs should return with info(), I agree that it should be consistent.

For the original issue over in arrow, I can point out that the following works fine:

pd.read_parquet("bucket/partitioned.parq", filesystem=fs)

i.e., specifying the filesystem rather than providing a protocol prefix. (also, fastparquet has no problem with any of the possible forms!)
This is because every filesystem has its own internal convention of how to name paths, and apparently arrow is not using something like fsspec.url_to_fs to find what the root path should be or otherwise processing the path. Consider, for example, that "gcs" is the conventional prefix for gcsfs, but other frameworks (particularly the gcloud CLI) use "gs", so automatically "readding" the prefix isn't straight-forward.

@martindurant
Copy link
Member

(note that noone asked for my opinion in the upstream arrow thread)

@martindurant
Copy link
Member

Also: I am not able to reproduce your behaviour with or without a placeholder directory. Can you try to make a full reproducer, please?

@pascalwhoop
Copy link
Author

That's curious, when you say "key and directory with the same name" does that mean we wrote that dir as a key first?

For context we're using pyspark to write a dataset to this path. I can imagine it creates a placeholder there first, although running spark against object storage is a pretty common scenario.

@martindurant
Copy link
Member

when you say "key and directory with the same name" does that mean we wrote that dir as a key first?

I can't say how it came to be, only that I suppose you have both a key called "bucket/path" and stuff with names like "bucket/path/..." which also implies the directory.
Now, as I say, I tried also making such a key, but did not see any problem doing info() on it afterward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants