-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
I came upon a bug while querying my custom Parquet dataset, which causes DataFusion to produce incoherent and incorrect results.
I tested my dataset in various ways, all of which produced the desired results:
- reading parquet files using python pandas, then merging and filtering the data there
- encoding into CSV, and reading the data with DataFusion
- creating an SQLite database using the provided CSV files, and using the same queries there
To Reproduce
Steps to reproduce the behavior:
- Download all the code and data I used for testing:
Inside there are the Parquet files and CSVs with exactly the same data (also, there's an sqlite database created from the provided CSV files).
- Use the instructions included in
README.mdto reproduce the issue:
The query, that fails when querying Parquet files with datafusion-cli:
-- 1. Distinct stop names
SELECT DISTINCT stop_name FROM stop INNER JOIN trip ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;Change only in where from line to trip_line produces the desired results.
Expected behavior
Should produce these 27 rows:
Bartnicza
Bazyliańska
Bolesławicka
Brzezińska
Budowlana
Choszczówka
Chłodnia
Daniszewska
Fabryka Pomp
Insurekcji
Marcelin
Marywilska-Las
Ołówkowa
PKP Płudy
PKP Żerań
Parowozowa
Pelcowizna
Polnych Kwiatów
Raciborska
Rembielińska
Sadkowska
Smugowa
Starego Dębu
Zyndrama z Maszkowic
os.Marywilska
Śpiewaków
None
Query 1 from README.md (mentioned above) produces this incorrect set of 33 rows:
+----------------------+
| stop_name |
+----------------------+
| Bartnicza |
| Bazyliańska |
| Bolesławicka |
| Brzezińska |
| Budowlana |
| Choszczówka |
| Chłodnia |
| Cygańska |
| Czołgistów |
| Daniszewska |
| Fabryka Pomp |
| Insurekcji |
| Majerankowa |
| Marcelin |
| Marywilska-Las |
| Ołówkowa |
| PKP Falenica |
| PKP Płudy |
| PKP Żerań |
| Parowozowa |
| Pelcowizna |
| Polnych Kwiatów |
| Raciborska |
| Rembielińska |
| Rokosowska |
| Sadkowska |
| Smugowa |
| Starego Dębu |
| Zbójna Góra |
| Zyndrama z Maszkowic |
| os.Marywilska |
| Śpiewaków |
| |
+----------------------+
Additional context
Datafusion version:
$ datafusion-cli --version
DataFusion 5.1.0My guess
Since the Parquet files have encoded NULLs, and reading the CSV files with datafusion-cli gets rid of those, my best bet is on the usage of NULLs and some weir behavior when joining.