Skip to content

Change default SQL mapping for VARCAHR from Utf8 to Utf8View #16142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 30, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented May 22, 2025

Which issue does this PR close?

Rationale for this change

Setting Varchar default mapping to utf8view, and change testing result.

What changes are included in this PR?

Setting Varchar default mapping to utf8view, and change testing result.

Are these changes tested?

Yes, new testing/slt testing result generated.

Are there any user-facing changes?

No, the internal datatype will change to utf8view.

@zhuqi-lucas zhuqi-lucas marked this pull request as draft May 22, 2025 06:33
@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels May 22, 2025
@zhuqi-lucas
Copy link
Contributor Author

Interesting, sqllogic testing successfully for my local mac but CI failed...

@zhuqi-lucas zhuqi-lucas force-pushed the varchar_default_ut8view branch from 6ee857a to 0780263 Compare May 22, 2025 08:17
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review May 26, 2025 12:14
@zhuqi-lucas zhuqi-lucas changed the title POC Varchar default mapping to utf8view feat: varchar default mapping to utf8view May 26, 2025
@zhuqi-lucas
Copy link
Contributor Author

Resolved all problems now including the slt testing.

I think the PR is ready for benchmark progress to see if switching to Utf8View by default slows some queries down. cc @alamb Thanks a lot.

@alamb
Copy link
Contributor

alamb commented May 27, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing varchar_default_ut8view (5d9c537) to f3aed4a diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented May 27, 2025

Resolved all problems now including the slt testing.

I think the PR is ready for benchmark progress to see if switching to Utf8View by default slows some queries down. cc @alamb Thanks a lot.

THanks @zhuqi-lucas -- I have started it up

@alamb
Copy link
Contributor

alamb commented May 27, 2025

🤖: Benchmark completed

Details

Comparing HEAD and varchar_default_ut8view
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ varchar_default_ut8view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1898.26ms │               1810.38ms │     no change │
│ QQuery 1     │   658.97ms │                670.82ms │     no change │
│ QQuery 2     │  1406.85ms │               1396.48ms │     no change │
│ QQuery 3     │   709.94ms │                713.90ms │     no change │
│ QQuery 4     │  1429.02ms │               1398.58ms │     no change │
│ QQuery 5     │ 15171.65ms │              15018.76ms │     no change │
│ QQuery 6     │  2057.97ms │               1992.46ms │     no change │
│ QQuery 7     │  2336.53ms │               2061.27ms │ +1.13x faster │
│ QQuery 8     │   839.18ms │                837.10ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 26508.38ms │
│ Total Time (varchar_default_ut8view)   │ 25899.76ms │
│ Average Time (HEAD)                    │  2945.38ms │
│ Average Time (varchar_default_ut8view) │  2877.75ms │
│ Queries Faster                         │          1 │
│ Queries Slower                         │          0 │
│ Queries with No Change                 │          8 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ varchar_default_ut8view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    15.20ms │                 15.35ms │     no change │
│ QQuery 1     │    33.35ms │                 33.29ms │     no change │
│ QQuery 2     │    83.06ms │                 79.39ms │     no change │
│ QQuery 3     │    96.03ms │                 95.15ms │     no change │
│ QQuery 4     │   603.39ms │                590.48ms │     no change │
│ QQuery 5     │   851.82ms │                855.46ms │     no change │
│ QQuery 6     │    23.71ms │                 22.55ms │     no change │
│ QQuery 7     │    36.83ms │                 37.07ms │     no change │
│ QQuery 8     │   933.66ms │                916.79ms │     no change │
│ QQuery 9     │  1192.79ms │               1193.28ms │     no change │
│ QQuery 10    │   260.86ms │                262.38ms │     no change │
│ QQuery 11    │   299.90ms │                293.48ms │     no change │
│ QQuery 12    │   906.00ms │                898.14ms │     no change │
│ QQuery 13    │  1344.05ms │               1220.93ms │ +1.10x faster │
│ QQuery 14    │   842.98ms │                823.52ms │     no change │
│ QQuery 15    │   842.44ms │                835.95ms │     no change │
│ QQuery 16    │  1725.19ms │               1718.98ms │     no change │
│ QQuery 17    │  1598.00ms │               1614.71ms │     no change │
│ QQuery 18    │  3092.64ms │               3093.01ms │     no change │
│ QQuery 19    │    81.09ms │                 81.79ms │     no change │
│ QQuery 20    │  1111.88ms │               1115.69ms │     no change │
│ QQuery 21    │  1298.42ms │               1296.34ms │     no change │
│ QQuery 22    │  2165.31ms │               2174.91ms │     no change │
│ QQuery 23    │  8034.33ms │               7921.79ms │     no change │
│ QQuery 24    │   458.46ms │                457.26ms │     no change │
│ QQuery 25    │   391.76ms │                372.64ms │     no change │
│ QQuery 26    │   528.56ms │                519.61ms │     no change │
│ QQuery 27    │  1595.08ms │               1586.78ms │     no change │
│ QQuery 28    │ 12566.00ms │              12405.74ms │     no change │
│ QQuery 29    │   535.24ms │                532.68ms │     no change │
│ QQuery 30    │   796.06ms │                807.39ms │     no change │
│ QQuery 31    │   863.55ms │                857.96ms │     no change │
│ QQuery 32    │  2633.13ms │               2621.41ms │     no change │
│ QQuery 33    │  3300.20ms │               3347.80ms │     no change │
│ QQuery 34    │  3429.42ms │               3335.38ms │     no change │
│ QQuery 35    │  1338.57ms │               1340.32ms │     no change │
│ QQuery 36    │   123.48ms │                118.34ms │     no change │
│ QQuery 37    │    55.36ms │                 56.67ms │     no change │
│ QQuery 38    │   119.30ms │                117.49ms │     no change │
│ QQuery 39    │   191.24ms │                187.82ms │     no change │
│ QQuery 40    │    46.61ms │                 47.57ms │     no change │
│ QQuery 41    │    45.34ms │                 44.25ms │     no change │
│ QQuery 42    │    39.10ms │                 38.06ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 56529.39ms │
│ Total Time (varchar_default_ut8view)   │ 55985.60ms │
│ Average Time (HEAD)                    │  1314.64ms │
│ Average Time (varchar_default_ut8view) │  1301.99ms │
│ Queries Faster                         │          1 │
│ Queries Slower                         │          0 │
│ Queries with No Change                 │         42 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ varchar_default_ut8view ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 122.52ms │                121.90ms │ no change │
│ QQuery 2     │  22.09ms │                 22.25ms │ no change │
│ QQuery 3     │  33.67ms │                 34.12ms │ no change │
│ QQuery 4     │  19.78ms │                 19.82ms │ no change │
│ QQuery 5     │  50.90ms │                 52.52ms │ no change │
│ QQuery 6     │  11.94ms │                 11.89ms │ no change │
│ QQuery 7     │  94.09ms │                 93.62ms │ no change │
│ QQuery 8     │  25.99ms │                 25.51ms │ no change │
│ QQuery 9     │  60.34ms │                 59.78ms │ no change │
│ QQuery 10    │  56.02ms │                 56.36ms │ no change │
│ QQuery 11    │  11.44ms │                 11.54ms │ no change │
│ QQuery 12    │  41.27ms │                 41.51ms │ no change │
│ QQuery 13    │  27.81ms │                 27.49ms │ no change │
│ QQuery 14    │   9.53ms │                  9.54ms │ no change │
│ QQuery 15    │  23.84ms │                 24.62ms │ no change │
│ QQuery 16    │  21.35ms │                 22.00ms │ no change │
│ QQuery 17    │  95.29ms │                 95.13ms │ no change │
│ QQuery 18    │ 207.37ms │                212.10ms │ no change │
│ QQuery 19    │  26.88ms │                 26.29ms │ no change │
│ QQuery 20    │  38.11ms │                 37.17ms │ no change │
│ QQuery 21    │ 164.65ms │                161.44ms │ no change │
│ QQuery 22    │  15.92ms │                 16.45ms │ no change │
└──────────────┴──────────┴─────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1180.80ms │
│ Total Time (varchar_default_ut8view)   │ 1183.07ms │
│ Average Time (HEAD)                    │   53.67ms │
│ Average Time (varchar_default_ut8view) │   53.78ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │        22 │
└────────────────────────────────────────┴───────────┘

@zhuqi-lucas
Copy link
Contributor Author

Thank you @alamb , it seems no regression from the benchmark result!

@zhuqi-lucas zhuqi-lucas force-pushed the varchar_default_ut8view branch from 5d9c537 to e54e432 Compare May 27, 2025 10:54
@github-actions github-actions bot added the functions Changes to functions implementation label May 27, 2025
@alamb alamb added the api change Changes the API exposed to users of the crate label May 27, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- this is great

FYI @jayzhan211 , @timsaucer @xudong963 and @comphead

I have a few suggestions on how to simplify the examples which would be nice to get in prior to merge but I don't think it is needed

// Assuming the input columns are
// column[0]: customer_id / UTF8
// column[0]: customer_id / UTF8 or UTF8View
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the example would be simpler if it just used Utf8View. Could we do that and avoid adding the logic to handle both types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! Thank you @alamb , addressed in latest PR.

@@ -15,6 +15,10 @@
# specific language governing permissions and limitations
# under the License.

# Currently, the avro not support Utf8View type, so we disable the map_varchar_to_utf8view
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

parquet-testing Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appears to be an update of parquet-testing with unrelated content, but I think it is ok to update it too

@@ -182,7 +182,11 @@ async fn write_out(ctx: &SessionContext) -> std::result::Result<(), DataFusionEr
let mut df = ctx.sql("values ('a'), ('b'), ('c')").await.unwrap();

// Ensure the column names and types match the target table
df = df.with_column_renamed("column1", "tablecol1").unwrap();
df = df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if we could change the example to maybe just use a StringViewArray in the first place (rather than have to cast it here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! Thank you @alamb , addressed in latest PR.

@alamb alamb changed the title feat: varchar default mapping to utf8view Change default SQL mapping for VARCAHR from Utf8 to Utf8View May 27, 2025
@zhuqi-lucas
Copy link
Contributor Author

Thank you @zhuqi-lucas -- this is great

FYI @jayzhan211 , @timsaucer @xudong963 and @comphead

I have a few suggestions on how to simplify the examples which would be nice to get in prior to merge but I don't think it is needed

Thank you @alamb for review, i also addressed the suggestions about example in latest PR.

@alamb
Copy link
Contributor

alamb commented May 28, 2025

I'll plan to merge this tomorrow unless I hear anything different

@zhuqi-lucas zhuqi-lucas force-pushed the varchar_default_ut8view branch from e7243d4 to 9195b6a Compare May 29, 2025 10:18
@xudong963
Copy link
Member

I'm reviewing

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhuqi-lucas
Copy link
Contributor Author

Thank you @xudong963 for review!

@xudong963
Copy link
Member

Let's go, thank you all

@xudong963 xudong963 merged commit 21248fb into apache:main May 30, 2025
28 checks passed
@alamb
Copy link
Contributor

alamb commented May 30, 2025

Thanks again @zhuqi-lucas and @xudong963

I have also made a PR to add a mention of this change to the upgrade guide:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation functions Changes to functions implementation sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change mapping of SQL VARCHAR from Utf8 to Utf8View
3 participants