Error with partitioned dataset: Cannot infer common argument type for comparison operation Binary != Utf8 #7039

alamb · 2023-07-20T11:14:35Z

Describe the bug

When running the following query (from ClickBench) on the partitioned dataset (100 parquet files)

SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits_partitioned WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

I get the following error:

Error during planning: Cannot infer common argument type for comparison operation Binary != Utf8

To Reproduce

Get the data using bench.sh (after #7005 is merged)

bench.sh data clickbench_1
bench.sh data clickbench_multi

CREATE EXTERNAL TABLE hits_partitioned
STORED AS PARQUET
LOCATION 'hits_partitioned';

SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits_partitioned WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

Expected behavior

The query works fine with the single file dataset. I expect the same error

-- Single file parquet
CREATE EXTERNAL TABLE hits_single
STORED AS PARQUET
LOCATION 'hits.parquet';

-- Single file works great
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits_single WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

+------------------------------+---------+
| hits_single.MobilePhoneModel | u |
+------------------------------+---------+
| iPad | 1090347 |
| iPhone | 45758 |
| A500 | 16046 |
| N8-00 | 5565 |
| iPho | 3300 |
| ONE TOUCH 6030A | 2759 |
| GT-P7300B | 1907 |
| 3110000 | 1871 |
| GT-I9500 | 1598 |
| eagle75 | 1492 |
+------------------------------+---------+

Additional context

I found this while working on some benchmark results for #6988

The text was updated successfully, but these errors were encountered:

jonahgao · 2023-07-24T14:10:18Z

The schemas of these two datasets in Parquet format are different.

$ parquet-schema hits.parquet | grep MobilePhoneModel
  REQUIRED BYTE_ARRAY MobilePhoneModel (STRING);

$ parquet-schema hits_partitioned/hits_0.parquet | grep MobilePhoneModel
  OPTIONAL BYTE_ARRAY MobilePhoneModel;

The single file dataset has specified the logical type as String for MobilePhoneModel, but the partitioned one does not have that specification.

In the partitioned dataset, the type inference for the MobilePhoneModel is Binary.

I think we may need to support type coercion from Utf8 to Binary here.

alamb · 2023-07-24T19:23:41Z

Thank you for the investigation @jonahgao -- think coercion for binary --> UTF8 would make sense. I will try and find some time to work on this if no one beats me to it.

🤔 Or maybe I will file a ticket and see if someone else wants to work on it.

jonahgao · 2023-07-25T06:13:05Z

#7080 should make the query execution successful.
But the MobilePhoneModel displayed in binary format.

+-----------------------------------+---------+
| hits_partitioned.MobilePhoneModel | u       |
+-----------------------------------+---------+
| 69506164                          | 1090347 |
| 6950686f6e65                      | 45758   |
| 41353030                          | 16046   |
| 4e382d3030                        | 5565    |
| 6950686f                          | 3300    |
| 4f4e4520544f554348203630333041    | 2759    |
| 47542d503733303042                | 1907    |
| 33313130303030                    | 1871    |
| 47542d4939353030                  | 1598    |
| 6561676c653735                    | 1492    |
+-----------------------------------+---------+

It would be better if we could manually specify the data type of the columns, which is not currently supported for Parquet.
I think that we can do this safely when we are certain that the binary data is valid UTF-8.

create external table hits_partitioned(
    "UserID" bigint,
    "MobilePhoneModel" text
)
STORED AS PARQUET
LOCATION 'hits_partitioned';

https://github.com/apache/arrow-datafusion/blob/1a0542acbc01e5243471ae0fc3586c2f1f40013b/datafusion/sql/src/statement.rs#L636-L640

alamb · 2023-07-25T12:28:58Z

I think that we can do this safely when we are certain that the binary data is valid UTF-8.

I agree -- I think it would be good to cast the column into UTF8 (erroring if it contains non UTF8 data)

For this particular query we could add an explicit cast like ::varchar perhaps like the following

SELECT 
  "MobilePhoneModel"::varchar, 
  COUNT(DISTINCT "UserID") AS u 
FROM hits_partitioned 
WHERE "MobilePhoneModel" <> '' 
GROUP BY "MobilePhoneModel" 
ORDER BY u DESC LIMIT 10;

jonahgao · 2023-07-25T14:35:18Z

For this particular query we could add an explicit cast like ::varchar perhaps like the following

This must be the best way 👍 .

alamb added the bug Something isn't working label Jul 20, 2023

alamb changed the title ~~ClickBench error with partitioned dataset: Cannot infer common argument type for comparison operation Binary != Utf8~~ Error with partitioned dataset: Cannot infer common argument type for comparison operation Binary != Utf8 Jul 24, 2023

jonahgao mentioned this issue Jul 25, 2023

Add Utf8->Binary type coercion for comparison #7080

Merged

alamb assigned jonahgao Jul 25, 2023

alamb closed this as completed in #7080 Jul 25, 2023

alamb mentioned this issue Jul 30, 2023

Update Clickbench results with DataFusion 28.0.0 #7108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with partitioned dataset: Cannot infer common argument type for comparison operation Binary != Utf8 #7039

Error with partitioned dataset: Cannot infer common argument type for comparison operation Binary != Utf8 #7039

alamb commented Jul 20, 2023

jonahgao commented Jul 24, 2023 •

edited

Loading

alamb commented Jul 24, 2023

jonahgao commented Jul 25, 2023 •

edited

Loading

alamb commented Jul 25, 2023

jonahgao commented Jul 25, 2023

Error with partitioned dataset: Cannot infer common argument type for comparison operation Binary != Utf8 #7039

Error with partitioned dataset: Cannot infer common argument type for comparison operation Binary != Utf8 #7039

Comments

alamb commented Jul 20, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

jonahgao commented Jul 24, 2023 • edited Loading

alamb commented Jul 24, 2023

jonahgao commented Jul 25, 2023 • edited Loading

alamb commented Jul 25, 2023

jonahgao commented Jul 25, 2023

jonahgao commented Jul 24, 2023 •

edited

Loading

jonahgao commented Jul 25, 2023 •

edited

Loading