Description
Is your feature request related to a problem or challenge?
DataFusion uses Arrow types internally. Thus when planning SQL queries there is a mapping from SQL types to Arrow Types. The current mapping for character types is shown in the docs https://datafusion.apache.org/user-guide/sql/data_types.html#character-types
SQL DataType | Arrow DataType |
---|---|
CHAR |
Utf8 |
VARCHAR |
Utf8 |
TEXT |
Utf8 |
STRING |
Utf8 |
So this means that when you do something like create table foo(x varchar);
the x
column is Utf8
DataFusion CLI v46.0.0
> create table foo(x varchar);
0 row(s) fetched.
Elapsed 0.019 seconds.
> describe foo;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| x | Utf8 | YES |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.008 seconds.
When reading parquet files however, a different type, Utf8View
is used as it is faster in most cases.
This can be seen in this example:
DataFusion CLI v46.0.0
> describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID | Int64 | NO |
| JavaEnable | Int16 | NO |
| Title | Utf8View | NO |
...
+-----------------------+-----------+-------------+
105 row(s) fetched.
Elapsed 0.032 seconds.
Thus there is a discrepancy when creating external tables with a schema (VARCHAR
) as that will use Utf8 rather than UTF8View
I believe this is the root cause of the issue @zhuqi-lucas filed:
Describe the solution you'd like
I think we should consider changing the default SQL mapping from VARCHAR
--> Utf8View
Describe alternatives you've considered
- @zhuqi-lucas has a PR that does this: BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE #14922
There are a few subtasks required before we can merge it:
- feat: Support Utf8View in JSON reader arrow-rs#7244
- Support datatype cast for insert api same as insert into sql #15015
- Update the SQL documentation to refer to the changed mapping
- Support approx_distinct for Utf8View
- feat: topk functionality for aggregates should support utf8view and largeutf8 #15152
- Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving merging on a single
Utf8View
) #15403
Additional context
You can see some of the history related to using string view / Utf8View here: