Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal error: The "character_length" function can only accept strings #7344

Closed
JayjeetAtGithub opened this issue Aug 21, 2023 · 10 comments · Fixed by #7840
Closed

Internal error: The "character_length" function can only accept strings #7344

JayjeetAtGithub opened this issue Aug 21, 2023 · 10 comments · Fixed by #7840
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@JayjeetAtGithub
Copy link
Contributor

JayjeetAtGithub commented Aug 21, 2023

Describe the bug

On running the query below on the Clickbench multi file dataset,

SELECT "CounterID", AVG(length("URL")) AS l, COUNT(*) AS c FROM hits WHERE "URL" <> '' GROUP BY "CounterID" HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

we get this error,

Internal error: The "character_length" function can only accept strings.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

To Reproduce

Download the data using,

 ./benchmarks/bench.sh data clickbench_partitioned

A hits_multi directory with the parquet files will be created.

Execute the above queries,

datafusion-cli -c "CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits_multi';" "{query}"

Expected behavior

The queries should run successfully without erroring.

Additional context

Datafusion 29.0.0

@JayjeetAtGithub JayjeetAtGithub added the bug Something isn't working label Aug 21, 2023
@parkma99
Copy link
Contributor

It looks simple to fix, I will take it.

@alamb
Copy link
Contributor

alamb commented Aug 21, 2023

This looks similar to #7039 which @jonahgao fixed by adding a coercion from binary --> UTF8 for comparison. I think we could do something similar here.

@alamb
Copy link
Contributor

alamb commented Aug 21, 2023

Marking as a good first issue as there is a reproducer and I think the fix should be relatively straightforward

@alamb
Copy link
Contributor

alamb commented Aug 21, 2023

FWIW #7355 (review) will also make this a non internal error (though @parkma99 's real fix would be great)

@jhorstmann
Copy link
Contributor

I'm wondering if an implicit cast is really the right solution here. Maybe a better solution would be to declare the schema / override the type in the CREATE TABLE statement.

Some other databases also have CHAR_LENGTH return the number of bytes on binary inputs: snowflake LENGTH function, Databricks CHARACTER_LENGTH

@jonahgao
Copy link
Member

Declaring the datatype of URL as UTF8 is a quick fix.
Some DBMSs do this in Clickbench.


But I think it might also be necessary to make the LENGTH function support binaries.
PostgreSQL also supports it.

psql=> create table foo(c bytea);
CREATE TABLE
psql=> insert into foo values('bar');
INSERT 0 1
psql=> select length(c) from foo;
 length
--------
      3
(1 row)

@tustvold
Copy link
Contributor

tustvold commented Aug 23, 2023

I personally find the notion of a function called character_length operating on binary data somewhat perverse, binary data by definition doesn't have a notion of a character.

Perhaps we could have a character length function that operates on strings, and a separate length kernel that returns the number of bytes - i.e. https://docs.rs/arrow-string/latest/arrow_string/length/fn.length.html ? A brief google search suggests this notion is normally called OCTET_LENGTH.

This would be consistent with postgres

I'm wondering if an implicit cast is really the right solution here.

I would second this, it feels like the clickbench setup may not be quite correct

PostgreSQL also supports it.

The docs would suggest that what is happening here is it is coercing the arguments to a string, I'll try to confirm this

@jonahgao
Copy link
Member

The docs would suggest that what is happening here is it is coercing the arguments to a string, I'll try to confirm this

I have just checked the documentation and found that the length function in PostgresSQL has two different forms:

length(bytea) → integer
    Returns the number of bytes in the binary string.
    length('\x1234567890'::bytea) → 5

length(bytes bytea, encoding name) → integer
   Returns the number of characters in the binary string, assuming that it is text in the given encoding.
   length('jose'::bytea, 'UTF8') → 4

It seems that there will be no type coercion happening.


I think we might need three different types of length functions.

  • character_length only operates on text strings.
  • octet_length can be applied to both text strings and binaries, and it returns the length of the underlying byte array.
  • length can be applied to both text strings and binaries.
    • For text strings, it has the same effect as character_length.
    • For binaries, it has the same effect as octet_length.

The following are the execution results of PostgreSQL for reference only.

postgres=#  create table text_table(c text);
CREATE TABLE
postgres=#  insert into text_table values('abc'), ('efg你好');
INSERT 0 2
postgres=# select length(c), octet_length(c), character_length(c) from text_table;
 length | octet_length | character_length
--------+--------------+------------------
      3 |            3 |                3
      5 |            9 |                5
(2 rows)

postgres=#
postgres=#
postgres=# create table binary_table(c bytea);
CREATE TABLE
postgres=#
postgres=# insert into binary_table values('abc'), ('\x00010203'::bytea);
INSERT 0 2
postgres=# select length(c), octet_length(c) from binary_table;
 length | octet_length
--------+--------------
      3 |            3
      4 |            4
(2 rows)

postgres=# select character_length(c) from binary_table;
ERROR:  function character_length(bytea) does not exist
LINE 1: select character_length(c) from binary_table;
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

@jonahgao
Copy link
Member

jonahgao commented Aug 23, 2023

I'm wondering if an implicit cast is really the right solution here. Maybe a better solution would be to declare the schema / override the type in the CREATE TABLE statement.

Yes, I think the simplest fix is to rewrite the query as

SELECT "CounterID", AVG(length(URL::TEXT)) AS l, COUNT(*) AS c FROM hits WHERE "URL" <> '' GROUP BY "CounterID" HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

We actually need character_length of text string here, so that the results should be consistent with others.
Calculating the length of a binary type may yield different results.

@alamb
Copy link
Contributor

alamb commented Aug 23, 2023

I personally find the notion of a function called character_length operating on binary data somewhat perverse, binary data by definition doesn't have a notion of a character.

Yes, this is why I think automatically coercing BinaryArray --> StringArray makes the most sense as during that process the binary data will be determined to be a string and then characters make sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
6 participants