Skip to content

Incorrect results in datafusion #1441

@franeklubi

Description

@franeklubi

Describe the bug
I came upon a bug while querying my custom Parquet dataset, which causes DataFusion to produce incoherent and incorrect results.

I tested my dataset in various ways, all of which produced the desired results:

  • reading parquet files using python pandas, then merging and filtering the data there
  • encoding into CSV, and reading the data with DataFusion
  • creating an SQLite database using the provided CSV files, and using the same queries there

To Reproduce
Steps to reproduce the behavior:

  1. Download all the code and data I used for testing:

issue_data.zip

Inside there are the Parquet files and CSVs with exactly the same data (also, there's an sqlite database created from the provided CSV files).

  1. Use the instructions included in README.md to reproduce the issue:

The query, that fails when querying Parquet files with datafusion-cli:

-- 1. Distinct stop names
SELECT DISTINCT stop_name FROM stop INNER JOIN trip ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;

Change only in where from line to trip_line produces the desired results.

Expected behavior
Should produce these 27 rows:

Bartnicza
Bazyliańska
Bolesławicka
Brzezińska
Budowlana
Choszczówka
Chłodnia
Daniszewska
Fabryka Pomp
Insurekcji
Marcelin
Marywilska-Las
Ołówkowa
PKP Płudy
PKP Żerań
Parowozowa
Pelcowizna
Polnych Kwiatów
Raciborska
Rembielińska
Sadkowska
Smugowa
Starego Dębu
Zyndrama z Maszkowic
os.Marywilska
Śpiewaków
None

Query 1 from README.md (mentioned above) produces this incorrect set of 33 rows:

+----------------------+
| stop_name            |
+----------------------+
| Bartnicza            |
| Bazyliańska          |
| Bolesławicka         |
| Brzezińska           |
| Budowlana            |
| Choszczówka          |
| Chłodnia             |
| Cygańska             |
| Czołgistów           |
| Daniszewska          |
| Fabryka Pomp         |
| Insurekcji           |
| Majerankowa          |
| Marcelin             |
| Marywilska-Las       |
| Ołówkowa             |
| PKP Falenica         |
| PKP Płudy            |
| PKP Żerań            |
| Parowozowa           |
| Pelcowizna           |
| Polnych Kwiatów      |
| Raciborska           |
| Rembielińska         |
| Rokosowska           |
| Sadkowska            |
| Smugowa              |
| Starego Dębu         |
| Zbójna Góra          |
| Zyndrama z Maszkowic |
| os.Marywilska        |
| Śpiewaków            |
|                      |
+----------------------+

Additional context
Datafusion version:

$ datafusion-cli --version
DataFusion 5.1.0

My guess
Since the Parquet files have encoded NULLs, and reading the CSV files with datafusion-cli gets rid of those, my best bet is on the usage of NULLs and some weir behavior when joining.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions