-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal error: The "character_length" function can only accept strings #7344
Comments
It looks simple to fix, I will take it. |
Marking as a good first issue as there is a reproducer and I think the fix should be relatively straightforward |
FWIW #7355 (review) will also make this a non internal error (though @parkma99 's real fix would be great) |
I'm wondering if an implicit cast is really the right solution here. Maybe a better solution would be to declare the schema / override the type in the Some other databases also have |
Declaring the datatype of
But I think it might also be necessary to make the LENGTH function support binaries.
|
I personally find the notion of a function called character_length operating on binary data somewhat perverse, binary data by definition doesn't have a notion of a character. Perhaps we could have a character length function that operates on strings, and a separate length kernel that returns the number of bytes - i.e. https://docs.rs/arrow-string/latest/arrow_string/length/fn.length.html ? A brief google search suggests this notion is normally called OCTET_LENGTH. This would be consistent with postgres
I would second this, it feels like the clickbench setup may not be quite correct
The docs would suggest that what is happening here is it is coercing the arguments to a string, I'll try to confirm this |
I have just checked the documentation and found that the length function in PostgresSQL has two different forms:
It seems that there will be no type coercion happening. I think we might need three different types of length functions.
The following are the execution results of PostgreSQL for reference only.
|
Yes, I think the simplest fix is to rewrite the query as SELECT "CounterID", AVG(length(URL::TEXT)) AS l, COUNT(*) AS c FROM hits WHERE "URL" <> '' GROUP BY "CounterID" HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25; We actually need |
Yes, this is why I think automatically coercing |
Describe the bug
On running the query below on the Clickbench multi file dataset,
we get this error,
To Reproduce
Download the data using,
A
hits_multi
directory with the parquet files will be created.Execute the above queries,
Expected behavior
The queries should run successfully without erroring.
Additional context
Datafusion 29.0.0
The text was updated successfully, but these errors were encountered: