Skip to content

Cannot query parquet files generated by Apache Spark from datafusion-cli #1648

@andygrove

Description

@andygrove

Describe the bug

I have a data set created by Apache Spark and I tried to query it from the DataFusion CLI. It failed, saying that a parquet file was corrupt.

 CREATE EXTERNAL TABLE store_sales STORED AS PARQUET LOCATION 'store_sales.dat';
0 rows in set. Query took 0.002 seconds.
❯ select count(*) from store_sales;
Parquet reader thread terminated due to error: ParquetError(General("Invalid Parquet file. Corrupt footer"))

I added some debug logging and found that it was actually trying to read the following file, which is not a Parquet file.

store_sales.dat/.part-00005-5142b177-bacb-499d-b14f-12de4b94d9d9-c000.snappy.parquet.crc

To Reproduce
Create a non-Parquet file with a non-Parquet extension and put it in a directory along with some valid parquet files.

Expected behavior
Should only try and read files with file extension .parquet.

Additional context
None

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggood first issueGood for newcomershelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions