-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I'm running into a weird issue with a datafusion query on a parquet file where if I select with a condition testing certain values, I get no results back.
To Reproduce
For example:
❯ create external table t0 stored as parquet location 'a7546b6b206d882e928a1325f8cbcce4.parquet';
0 rows in set. Query took 0.019 seconds.
❯ select distinct direction from t0;
+-----------+
| direction |
+-----------+
| Merged |
| Two Way |
| Outgoing |
| Incoming |
+-----------+
4 rows in set. Query took 0.015 seconds.
❯ select * from t0 where "direction" = 'Outgoing';
+--------------------------+-----------+-------+
| ul_node_id | direction | adt |
+--------------------------+-----------+-------+
| XRLDMlrMwNT4jrCE2hzvbA== | Outgoing | 222 |
| uuzYwO69bCHsouZHOeG8Wg== | Outgoing | 178 |
...snip...
| YlaylK0fQqkfvKbIONzkJQ== | Outgoing | 115 |
| AUsoU3ojLVrmgRJhuypwMA== | Outgoing | 649 |
+--------------------------+-----------+-------+
100 rows in set. Query took 0.013 seconds.
❯ select * from t0 where "direction" = 'Merged';
+--------------------------+-----------+------+
| ul_node_id | direction | adt |
+--------------------------+-----------+------+
| YxRwadNzW7uJht9xp4g46Q== | Merged | 5605 |
+--------------------------+-----------+------+
1 row in set. Query took 0.010 seconds.
❯ select * from t0 where "direction" = 'Incoming';
0 rows in set. Query took 0.007 seconds.
❯ select * from t0 where "direction" = 'Two Way';
0 rows in set. Query took 0.006 seconds.
I've checked to see if there's trailing whitespace in the values "Incoming" and "Two Way in the table and there isn't. pandas seems to be able to do similar queries just fine:
>>> df[df['direction'] == "Two Way"]
ul_node_id direction adt
1 XRLDMlrMwNT4jrCE2hzvbA== Two Way 420
5 uuzYwO69bCHsouZHOeG8Wg== Two Way 337
8 crHf6lzksh5sVFw8/Sf5Ew== Two Way 324
11 OSS+4pM008wnvNH/c19uHg== Two Way 4674
12 NF20BBZ0OjDvIbcGxQuhwQ== Two Way 23116
.. ... ... ...
288 kvQm8I+WxR9C1Il1XxyvbQ== Two Way 204
291 RCYo+XLiUXnvL7tb8pml3w== Two Way 5777
294 YTsSJkf0qT7tvaosC0gRhA== Two Way 4409
297 YlaylK0fQqkfvKbIONzkJQ== Two Way 336
298 AUsoU3ojLVrmgRJhuypwMA== Two Way 1049
[101 rows x 3 columns]
The parquet file above is attached.
a7546b6b206d882e928a1325f8cbcce4.parquet.zip
Expected behavior
The query should return values.
Additional context
Andy Grove demonstrated that disabling Parquet pruning causes the query to return the correct values: https://the-asf.slack.com/archives/C01QUFS30TD/p1700671664714069
$ export DATAFUSION_EXECUTION_PARQUET_PRUNING=false
❯ select count(*) from "test.parquet" where direction = 'Incoming';
+----------+
| COUNT(*) |
+----------+
| 99 |
+----------+
1 row in set. Query took 0.024 seconds.
$ export DATAFUSION_EXECUTION_PARQUET_PRUNING=true
❯ select count(*) from "test.parquet" where direction = 'Incoming';
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
1 row in set. Query took 0.023 seconds.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working