Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with partitioned dataset: Cannot infer common argument type for comparison operation Binary != Utf8 #7039

Closed
alamb opened this issue Jul 20, 2023 · 5 comments · Fixed by #7080
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Jul 20, 2023

Describe the bug

When running the following query (from ClickBench) on the partitioned dataset (100 parquet files)

SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits_partitioned WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

I get the following error:

Error during planning: Cannot infer common argument type for comparison operation Binary != Utf8

To Reproduce

Get the data using bench.sh (after #7005 is merged)

bench.sh data clickbench_1
bench.sh data clickbench_multi
CREATE EXTERNAL TABLE hits_partitioned
STORED AS PARQUET
LOCATION 'hits_partitioned';

SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits_partitioned WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

Expected behavior

The query works fine with the single file dataset. I expect the same error

-- Single file parquet
CREATE EXTERNAL TABLE hits_single
STORED AS PARQUET
LOCATION 'hits.parquet';

-- Single file works great
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits_single WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

+------------------------------+---------+
| hits_single.MobilePhoneModel | u |
+------------------------------+---------+
| iPad | 1090347 |
| iPhone | 45758 |
| A500 | 16046 |
| N8-00 | 5565 |
| iPho | 3300 |
| ONE TOUCH 6030A | 2759 |
| GT-P7300B | 1907 |
| 3110000 | 1871 |
| GT-I9500 | 1598 |
| eagle75 | 1492 |
+------------------------------+---------+


Additional context

I found this while working on some benchmark results for #6988

@alamb alamb added the bug Something isn't working label Jul 20, 2023
@alamb alamb changed the title ClickBench error with partitioned dataset: Cannot infer common argument type for comparison operation Binary != Utf8 Error with partitioned dataset: Cannot infer common argument type for comparison operation Binary != Utf8 Jul 24, 2023
@jonahgao
Copy link
Member

jonahgao commented Jul 24, 2023

The schemas of these two datasets in Parquet format are different.

$ parquet-schema hits.parquet | grep MobilePhoneModel
  REQUIRED BYTE_ARRAY MobilePhoneModel (STRING);

$ parquet-schema hits_partitioned/hits_0.parquet | grep MobilePhoneModel
  OPTIONAL BYTE_ARRAY MobilePhoneModel;

The single file dataset has specified the logical type as String for MobilePhoneModel, but the partitioned one does not have that specification.

In the partitioned dataset, the type inference for the MobilePhoneModel is Binary.

I think we may need to support type coercion from Utf8 to Binary here.

@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2023

Thank you for the investigation @jonahgao -- think coercion for binary --> UTF8 would make sense. I will try and find some time to work on this if no one beats me to it.

🤔 Or maybe I will file a ticket and see if someone else wants to work on it.

@jonahgao
Copy link
Member

jonahgao commented Jul 25, 2023

#7080 should make the query execution successful.
But the MobilePhoneModel displayed in binary format.

+-----------------------------------+---------+
| hits_partitioned.MobilePhoneModel | u       |
+-----------------------------------+---------+
| 69506164                          | 1090347 |
| 6950686f6e65                      | 45758   |
| 41353030                          | 16046   |
| 4e382d3030                        | 5565    |
| 6950686f                          | 3300    |
| 4f4e4520544f554348203630333041    | 2759    |
| 47542d503733303042                | 1907    |
| 33313130303030                    | 1871    |
| 47542d4939353030                  | 1598    |
| 6561676c653735                    | 1492    |
+-----------------------------------+---------+

It would be better if we could manually specify the data type of the columns, which is not currently supported for Parquet.
I think that we can do this safely when we are certain that the binary data is valid UTF-8.

create external table hits_partitioned(
    "UserID" bigint,
    "MobilePhoneModel" text
)
STORED AS PARQUET
LOCATION 'hits_partitioned';

https://github.com/apache/arrow-datafusion/blob/1a0542acbc01e5243471ae0fc3586c2f1f40013b/datafusion/sql/src/statement.rs#L636-L640

@alamb
Copy link
Contributor Author

alamb commented Jul 25, 2023

I think that we can do this safely when we are certain that the binary data is valid UTF-8.

I agree -- I think it would be good to cast the column into UTF8 (erroring if it contains non UTF8 data)

For this particular query we could add an explicit cast like ::varchar perhaps like the following

SELECT 
  "MobilePhoneModel"::varchar, 
  COUNT(DISTINCT "UserID") AS u 
FROM hits_partitioned 
WHERE "MobilePhoneModel" <> '' 
GROUP BY "MobilePhoneModel" 
ORDER BY u DESC LIMIT 10;

@jonahgao
Copy link
Member

For this particular query we could add an explicit cast like ::varchar perhaps like the following

This must be the best way 👍 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants