Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local object store accepts file:/// as base path, but LocalStore returns meta without the prefix. #1923

Closed
Igosuki opened this issue Mar 4, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@Igosuki
Copy link
Contributor

Igosuki commented Mar 4, 2022

Describe the bug
One can register a table with the file scheme file://, this in turns allows listing table to list files and find partitions.
Unfortunately, LocalStore returns a FileMetaStream where the SizedFile path has the prefix stripped. This could be fine except `datafusion::datasource::listing::helpers::parse_partitions_for_path``` calls strip_prefix on the file_path with the original path used to register the table, which contains the scheme.

There are two ways to fix this, either strip the scheme off the path in the registered table as well (would probably be best to let the ObjectStore implementation do that), or enhance FileMeta and use a URI instead of just a path.

To Reproduce
Steps to reproduce the behavior:

/tmp/listing_table/part1=value1/ and /tmp/listing_table/part1=value2/
should contain one parquet file each

let mut ctx = ExecutionContext::new();
        let listing_options = ListingOptions {
            file_extension: "parquet".to_string(),
            format: Arc::new(ParquetFormat::default()),
            table_partition_cols: vec!["part1"],
            collect_stat: true,
            target_partitions: 8,
        };
        ctx.register_listing_table(
            "my_table",
            "file:///tmp/listing_table",
            listing_options,
            None,
        )
        .await?;

       let df = ctx.sql("select count(*) from my_table").await?;
       let rb = df.collect().await?;
       eprintln!("rb = {:?}", rb);

Expected behavior
The above should count the lines in the files properly, with the current behavior it'll return 0.

Additional context
I'm trying to be consistent on my project and so I use schemes for both local and remote files. Finding this debug required a lot of debugging.

@Igosuki Igosuki added the bug Something isn't working label Mar 4, 2022
@houqp
Copy link
Member

houqp commented Mar 13, 2022

We ran into the same issue in delta-rs. I think the ideal solution would be to normalize table path within the objecstore implementation when it's being created.

The issue with using URI in fileMeta is all the object stores' list calls do return full URIs, so we will have to perform a lot of string creations in the heap to construct the URIs, this could be expensive when we need to deal with millions of files.

@tustvold
Copy link
Contributor

I believe this was implemented by a combination of #2578 and #2677. These standardised the interpretation of the file URIs and paths respectively

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants