Skip to content

Possible bug in Parquet pruning code? #8309

@maxburke

Description

@maxburke

Describe the bug

I'm running into a weird issue with a datafusion query on a parquet file where if I select with a condition testing certain values, I get no results back.

To Reproduce

For example:

❯ create external table t0 stored as parquet location 'a7546b6b206d882e928a1325f8cbcce4.parquet';
0 rows in set. Query took 0.019 seconds.
❯ select distinct direction from t0;
+-----------+
| direction |
+-----------+
| Merged    |
| Two Way   |
| Outgoing  |
| Incoming  |
+-----------+
4 rows in set. Query took 0.015 seconds.
❯ select * from t0 where "direction" = 'Outgoing';
+--------------------------+-----------+-------+
| ul_node_id               | direction | adt   |
+--------------------------+-----------+-------+
| XRLDMlrMwNT4jrCE2hzvbA== | Outgoing  | 222   |
| uuzYwO69bCHsouZHOeG8Wg== | Outgoing  | 178   |
...snip...
| YlaylK0fQqkfvKbIONzkJQ== | Outgoing  | 115   |
| AUsoU3ojLVrmgRJhuypwMA== | Outgoing  | 649   |
+--------------------------+-----------+-------+
100 rows in set. Query took 0.013 seconds.
❯ select * from t0 where "direction" = 'Merged';
+--------------------------+-----------+------+
| ul_node_id               | direction | adt  |
+--------------------------+-----------+------+
| YxRwadNzW7uJht9xp4g46Q== | Merged    | 5605 |
+--------------------------+-----------+------+
1 row in set. Query took 0.010 seconds.
❯ select * from t0 where "direction" = 'Incoming';
0 rows in set. Query took 0.007 seconds.
❯ select * from t0 where "direction" = 'Two Way';
0 rows in set. Query took 0.006 seconds.

I've checked to see if there's trailing whitespace in the values "Incoming" and "Two Way in the table and there isn't. pandas seems to be able to do similar queries just fine:

>>> df[df['direction'] == "Two Way"]
                   ul_node_id direction    adt
1    XRLDMlrMwNT4jrCE2hzvbA==   Two Way    420
5    uuzYwO69bCHsouZHOeG8Wg==   Two Way    337
8    crHf6lzksh5sVFw8/Sf5Ew==   Two Way    324
11   OSS+4pM008wnvNH/c19uHg==   Two Way   4674
12   NF20BBZ0OjDvIbcGxQuhwQ==   Two Way  23116
..                        ...       ...    ...
288  kvQm8I+WxR9C1Il1XxyvbQ==   Two Way    204
291  RCYo+XLiUXnvL7tb8pml3w==   Two Way   5777
294  YTsSJkf0qT7tvaosC0gRhA==   Two Way   4409
297  YlaylK0fQqkfvKbIONzkJQ==   Two Way    336
298  AUsoU3ojLVrmgRJhuypwMA==   Two Way   1049

[101 rows x 3 columns]

The parquet file above is attached.
a7546b6b206d882e928a1325f8cbcce4.parquet.zip

Expected behavior

The query should return values.

Additional context

Andy Grove demonstrated that disabling Parquet pruning causes the query to return the correct values: https://the-asf.slack.com/archives/C01QUFS30TD/p1700671664714069

$ export DATAFUSION_EXECUTION_PARQUET_PRUNING=false

❯ select count(*) from "test.parquet" where direction = 'Incoming';
+----------+
| COUNT(*) |
+----------+
| 99       |
+----------+
1 row in set. Query took 0.024 seconds.

$ export DATAFUSION_EXECUTION_PARQUET_PRUNING=true

❯ select count(*) from "test.parquet" where direction = 'Incoming';
+----------+
| COUNT(*) |
+----------+
| 0        |
+----------+
1 row in set. Query took 0.023 seconds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions