GCSFS reports directory as FileNotFoundError when it exists. Run 1 fails, run 2 succeeds. Caching? #632

pascalwhoop · 2024-07-18T15:23:06Z

Hi,
We went down a rabbit hole trying to find this one.
apache/arrow#31339
Turns out Pandas can't read partitioned parquet files from a directory because of PyArrow using GCSFS.

However in this repo there seems to be no mention of this. Are you aware of any situation where the library is non-deterministic/has caching issues when listing a directory?

import gcsfs
PATH = "bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes"
fs = gcsfs.GCSFileSystem()
print(fs.info(PATH))
print(fs.info(PATH))
print(fs.info(PATH))

Returns:

{'kind': 'storage#object', 'id': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes//1721313663057121', 'selfLink': 'https://www.googleapis.com/storage/v1/b/bucket-dev-storage/o/kedro%2Fstaging%2Fdata%2F05_model_input%2Fdrugs_diseases_nodes%2F', 'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/bucket-dev-storage/o/kedro%2Fstaging%2Fdata%2F05_model_input%2Fdrugs_diseases_nodes%2F?generation=1721313663057121&alt=media', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes/', 'bucket': 'bucket-dev-storage', 'generation': '1721313663057121', 'metageneration': '1', 'contentType': 'application/octet-stream', 'storageClass': 'STANDARD', 'size': 0, 'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==', 'crc32c': 'AAAAAA==', 'etag': 'COH5u4vpsIcDEAE=', 'timeCreated': '2024-07-18T14:41:03.059Z', 'updated': '2024-07-18T14:41:03.059Z', 'timeStorageClassUpdated': '2024-07-18T14:41:03.059Z', 'type': 'file'}
{'bucket': 'bucket-dev-storage', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}
{'bucket': 'bucket-dev-storage', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}

Note first call vs. 2 and 3 are different results. What's up with that?

The text was updated successfully, but these errors were encountered:

martindurant · 2024-07-18T15:36:36Z

You seem to have a key and a directory with the same name, which is unfortunate. While it is unclear which of these gcsfs should return with info(), I agree that it should be consistent.

For the original issue over in arrow, I can point out that the following works fine:

pd.read_parquet("bucket/partitioned.parq", filesystem=fs)

i.e., specifying the filesystem rather than providing a protocol prefix. (also, fastparquet has no problem with any of the possible forms!)
This is because every filesystem has its own internal convention of how to name paths, and apparently arrow is not using something like fsspec.url_to_fs to find what the root path should be or otherwise processing the path. Consider, for example, that "gcs" is the conventional prefix for gcsfs, but other frameworks (particularly the gcloud CLI) use "gs", so automatically "readding" the prefix isn't straight-forward.

martindurant · 2024-07-18T15:37:01Z

(note that noone asked for my opinion in the upstream arrow thread)

martindurant · 2024-07-18T15:44:03Z

Also: I am not able to reproduce your behaviour with or without a placeholder directory. Can you try to make a full reproducer, please?

pascalwhoop · 2024-07-18T18:11:19Z

That's curious, when you say "key and directory with the same name" does that mean we wrote that dir as a key first?

For context we're using pyspark to write a dataset to this path. I can imagine it creates a placeholder there first, although running spark against object storage is a pretty common scenario.

martindurant · 2024-07-18T18:38:25Z

when you say "key and directory with the same name" does that mean we wrote that dir as a key first?

I can't say how it came to be, only that I suppose you have both a key called "bucket/path" and stuff with names like "bucket/path/..." which also implies the directory.
Now, as I say, I tried also making such a key, but did not see any problem doing info() on it afterward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCSFS reports directory as FileNotFoundError when it exists. Run 1 fails, run 2 succeeds. Caching? #632

GCSFS reports directory as FileNotFoundError when it exists. Run 1 fails, run 2 succeeds. Caching? #632

pascalwhoop commented Jul 18, 2024 •

edited

Loading

martindurant commented Jul 18, 2024

martindurant commented Jul 18, 2024

martindurant commented Jul 18, 2024

pascalwhoop commented Jul 18, 2024

martindurant commented Jul 18, 2024

GCSFS reports directory as FileNotFoundError when it exists. Run 1 fails, run 2 succeeds. Caching? #632

GCSFS reports directory as FileNotFoundError when it exists. Run 1 fails, run 2 succeeds. Caching? #632

Comments

pascalwhoop commented Jul 18, 2024 • edited Loading

martindurant commented Jul 18, 2024

martindurant commented Jul 18, 2024

martindurant commented Jul 18, 2024

pascalwhoop commented Jul 18, 2024

martindurant commented Jul 18, 2024

pascalwhoop commented Jul 18, 2024 •

edited

Loading