Skip to content

Change mapping of SQL VARCHAR from Utf8 to Utf8View #15096

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

DataFusion uses Arrow types internally. Thus when planning SQL queries there is a mapping from SQL types to Arrow Types. The current mapping for character types is shown in the docs https://datafusion.apache.org/user-guide/sql/data_types.html#character-types

SQL DataType Arrow DataType
CHAR Utf8
VARCHAR Utf8
TEXT Utf8
STRING Utf8

So this means that when you do something like create table foo(x varchar); the x column is Utf8

DataFusion CLI v46.0.0
> create table foo(x varchar);

0 row(s) fetched.
Elapsed 0.019 seconds.

> describe foo;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| x           | Utf8      | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.008 seconds.

When reading parquet files however, a different type, Utf8View is used as it is faster in most cases.

This can be seen in this example:

DataFusion CLI v46.0.0
> describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID               | Int64     | NO          |
| JavaEnable            | Int16     | NO          |
| Title                 | Utf8View  | NO          |
...
+-----------------------+-----------+-------------+
105 row(s) fetched.
Elapsed 0.032 seconds.

Thus there is a discrepancy when creating external tables with a schema (VARCHAR) as that will use Utf8 rather than UTF8View

I believe this is the root cause of the issue @zhuqi-lucas filed:

Describe the solution you'd like

I think we should consider changing the default SQL mapping from VARCHAR --> Utf8View

Describe alternatives you've considered

There are a few subtasks required before we can merge it:

Additional context

You can see some of the history related to using string view / Utf8View here:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions