-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7069] Optimize metaclient construction and include table config… #10048
Conversation
3b9c187
to
4d6cab1
Compare
@Override | ||
void run() { | ||
HoodieClusteringJob.Config clusteringConfig = new HoodieClusteringJob.Config(); | ||
clusteringConfig.basePath = basePath; | ||
clusteringConfig.parallelism = parallelism; | ||
clusteringConfig.runningMode = clusteringMode; | ||
new HoodieClusteringJob(jsc, clusteringConfig, props).cluster(retry); | ||
new HoodieClusteringJob(jsc, clusteringConfig, props, metaClient).cluster(retry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be caution that the timeline should be refreshed each time for compaction when metaClient is reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This‘s right. In the existing code, the metaclient is reloaded during clustering, and a new metaclient is created during compaction. A better implementation would be to have consistent behavior for compaction as well. The metaclient should be reloaded at the point of creation.
private int doScheduleAndCluster(JavaSparkContext jsc) throws Exception {
LOG.info("Step 1: Do schedule");
metaClient = HoodieTableMetaClient.reload(metaClient);
...
}
� private int doCompact(JavaSparkContext jsc) throws Exception {
...
if (StringUtils.isNullOrEmpty(cfg.compactionInstantTime)) {
HoodieTableMetaClient metaClient = UtilHelpers.createMetaClient(jsc, cfg.basePath, true);...
}
I will make modifications here.
In my opinion, the main overhead of instantiating meta client is the loading and decoding of the hoodie instants which include file listing and deciphering of the commit metadata. The refreshing of hoodie instants for each round of table service scheduling is necessary becase the latest metadata is required. So I don't think we gains much by resuing the instance.
If there are some configuration inconsistencies, we can just fix it. |
… in write config for multi-table services.
… in write config for multi-table services.
… in write config for multi-table services.
f29ec5f
to
c168e2b
Compare
@danny0405 : You are right. The original intention of passing the meta client was not actually for performance optimization, but rather to obtain the table config when unifying the table config and write config early on. However, during the coding process, it was discovered that there were cases of redundant meta client construction in this code. Therefore, this part was also modified, simplifying the code and bringing about a small performance benefit. |
@@ -166,6 +168,9 @@ public static TableServicePipeline buildTableServicePipeline(JavaSparkContext js | |||
HoodieMultiTableServicesMain.Config cfg, | |||
TypedProperties props) { | |||
TableServicePipeline pipeline = new TableServicePipeline(); | |||
HoodieTableMetaClient metaClient = UtilHelpers.createMetaClient(jsc, basePath, true); | |||
// Add the table config to the write config. | |||
props.putAll(metaClient.getTableConfig().getProps()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use table config to overwrite those configs already set in write configs by user? Not sure which one have a higher priority here. @danny0405 What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can indeed lead to a priority issue here. A simple solution is to use addNecessaryTableConfigToWriteConfig
to add the necessary parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you refering to the per-table options? We should consider write config options with higher priority always.
… in write config for multi-table services.
… in write config for multi-table services.
… in write config for multi-table services.
… in write config for multi-table services.
commit dfa3bde Merge: bfc0a85 473cf9a Author: Jonathan Vexler <=> Date: Wed Nov 29 15:01:45 2023 -0500 Merge branch 'master' into fg_reader_implement_bootstrap commit bfc0a85 Author: Jonathan Vexler <=> Date: Wed Nov 29 14:55:57 2023 -0500 fix bug with nested required fields due to spark nested schema pruning bug commit 473cf9a Author: Rajesh Mahindra <76502047+rmahindra123@users.noreply.github.com> Date: Wed Nov 29 08:37:40 2023 -0800 [HUDI-7138] Fix error table writer and schema registry provider (apache#10173) --------- Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> commit 91eabab Author: Lin Liu <141371752+linliu-code@users.noreply.github.com> Date: Tue Nov 28 23:49:37 2023 -0800 [HUDI-7103] Support time travel queies for COW tables (apache#10109) This is based on HadoopFsRelation. commit b300728 Author: Rajesh Mahindra <76502047+rmahindra123@users.noreply.github.com> Date: Tue Nov 28 22:31:12 2023 -0800 [HUDI-7086] Fix the default for gcp pub sub max sync time to 1min (apache#10171) Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> commit 8370c62 Author: Shiyan Xu <2701446+xushiyan@users.noreply.github.com> Date: Tue Nov 28 22:31:34 2023 -0600 [HUDI-7149] Add a dbt example project with CDC capability (apache#10192) commit 817d81a Author: zhuanshenbsj1 <34104400+zhuanshenbsj1@users.noreply.github.com> Date: Wed Nov 29 11:46:20 2023 +0800 [MINOR] Add log to print wrong number of instant metadata files (apache#10196) commit cadeade Author: leixin <1403342953@qq.com> Date: Wed Nov 29 11:45:24 2023 +0800 [minor] when metric prefix length is 0 ignore the metric prefix (apache#10190) Co-authored-by: leixin1 <leixin1@jd.com> commit 91daa7d Author: Lin Liu <141371752+linliu-code@users.noreply.github.com> Date: Tue Nov 28 19:03:50 2023 -0800 [HUDI-7102] Fix bugs related to time travel queries (apache#10102) commit d1dfa5b Author: Dongsj <90449228+eric9204@users.noreply.github.com> Date: Wed Nov 29 10:49:38 2023 +0800 [HUDI-7148] Add an additional fix to the potential thread insecurity problem of heartbeat client (apache#10188) Co-authored-by: dongsj <dongsj@asiainfo.com> commit b0b711e Author: Jonathan Vexler <=> Date: Tue Nov 28 21:35:20 2023 -0500 nested schema kinda fix commit 77cfb3a Author: YueZhang <69956021+zhangyue19921010@users.noreply.github.com> Date: Wed Nov 29 09:46:53 2023 +0800 [HUDI-7147] Fix CDC write flush bug (apache#10186) * Using iterator instead of values to avoid unsupported operation exception * check style commit b144ee0 Author: Jon Vexler <jbvexler@gmail.com> Date: Tue Nov 28 14:23:46 2023 -0500 Update hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> commit 89fab14 Author: Jonathan Vexler <=> Date: Tue Nov 28 14:23:03 2023 -0500 fix failing tests and address some of sagar pr review commit 675abf1 Author: Tim Brown <tim@onehouse.ai> Date: Mon Nov 27 23:21:56 2023 -0600 [MINOR] Schema Converter should use default identity transform if not specified (apache#10178) commit 5450aff Author: Jonathan Vexler <=> Date: Mon Nov 27 22:21:06 2023 -0500 disable vector for bootstrap commit fb062df Author: Danny Chan <yuzhao.cyz@gmail.com> Date: Tue Nov 28 10:52:33 2023 +0800 [Minor] Fix the flaky tests in TestRemoteHoodieTableFileSystemView (apache#10179) commit 3ae4d30 Author: Jonathan Vexler <=> Date: Mon Nov 27 21:07:17 2023 -0500 fix various issues that caused failing tests commit a045da6 Author: Jonathan Vexler <=> Date: Mon Nov 27 18:00:46 2023 -0500 see if this works commit 91be81a Author: Jonathan Vexler <=> Date: Mon Nov 27 17:07:30 2023 -0500 use java to create unary operator commit c22d1db Merge: 38b2603 4c3a1db Author: Jonathan Vexler <=> Date: Mon Nov 27 15:56:39 2023 -0500 Merge branch 'master' into fg_reader_implement_bootstrap commit 38b2603 Author: Jonathan Vexler <=> Date: Mon Nov 27 15:42:22 2023 -0500 set precombine in test commit 2a9a363 Author: Jonathan Vexler <=> Date: Mon Nov 27 13:27:38 2023 -0500 try to fix scala2.11 unary operator issue commit 60bdf14 Author: Jonathan Vexler <=> Date: Mon Nov 27 13:02:16 2023 -0500 try fix ci commit 4c3a1db Author: majian <47964462+majian1998@users.noreply.github.com> Date: Mon Nov 27 16:44:25 2023 +0800 [HUDI-7110][FOLLOW-UP] Improve call procedure for show column stats information (apache#10169) commit 499423c Author: zhuanshenbsj1 <34104400+zhuanshenbsj1@users.noreply.github.com> Date: Sun Nov 26 10:13:46 2023 +0800 [HUDI-7041] Optimize the memory usage of timeline server for table service (apache#10002) commit 4f875ed Author: Y Ethan Guo <ethan.guoyihua@gmail.com> Date: Sat Nov 25 15:10:37 2023 -0800 [HUDI-7139] Fix operation type for bulk insert with row writer in Hudi Streamer (apache#10175) This commit fixes the bug which causes the `operationType` to be null in the commit metadata of bulk insert operation with row writer enabled in Hudi Streamer (`hoodie.datasource.write.row.writer.enable=true`). `HoodieStreamerDatasetBulkInsertCommitActionExecutor` is updated so that `#preExecute` and `#afterExecute` should run the same logic as regular bulk insert operation without row writer. commit 332e7e8 Author: harshal <harshal.j.patil@gmail.com> Date: Sat Nov 25 14:04:29 2023 +0530 [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync (apache#10158) --------- Co-authored-by: sivabalan <n.siva.b@gmail.com> commit 86232d2 Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Thu Nov 23 19:27:50 2023 -0800 [HUDI-7095] Making perf enhancements to JSON serde (apache#10097) commit a7fd27c Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Thu Nov 23 19:20:01 2023 -0800 [HUDI-7086] Scaling gcs event source (apache#10073) - Scaling gcs event source --------- Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> commit bb42c4b Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Thu Nov 23 18:33:32 2023 -0800 [HUDI-7097] Fix instantiation of Hms Uri with HiveSync tool (apache#10099) commit 0b7f47a Author: Jonathan Vexler <=> Date: Thu Nov 23 16:27:36 2023 -0500 decently working commit bcb974b Author: VitoMakarevich <vitaliy.makarevich.work@gmail.com> Date: Thu Nov 23 11:22:14 2023 +0100 [HUDI-7034] Fix refresh table/view (apache#10151) * [HUDI-7034] Refresh index fix - remove cached file slices within partitions --------- Co-authored-by: vmakarevich <vitali.makarevich@instructure.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> commit b77eff2 Author: Lokesh Jain <ljain@apache.org> Date: Thu Nov 23 10:47:40 2023 +0530 [HUDI-7120] Performance improvements in deltastreamer executor code path (apache#10135) commit 405be17 Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Wed Nov 22 21:00:33 2023 -0800 [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) (apache#10095) * Making misc fixes to deltastreamer sources * Fixing test failures * adding inference to CloudSourceconfig... cloud.data.datafile.format * Fix the tests for s3 events source * Fix the tests for s3 events source --------- Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> commit 3d21285 Author: Tim Brown <tim@onehouse.ai> Date: Wed Nov 22 22:51:14 2023 -0600 [HUDI-7112] Reuse existing timeline server and performance improvements (apache#10122) - Reuse timeline server across tables. --------- Co-authored-by: sivabalan <n.siva.b@gmail.com> commit 72ff9a7 Author: Rajesh Mahindra <76502047+rmahindra123@users.noreply.github.com> Date: Wed Nov 22 20:49:15 2023 -0800 [HUDI-7052] Fix partition key validation for custom key generators. (apache#10014) --------- Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> commit 8d6d043 Author: majian <47964462+majian1998@users.noreply.github.com> Date: Thu Nov 23 10:08:17 2023 +0800 [HUDI-7110] Add call procedure for show column stats information (apache#10120) commit aabaa99 Author: huangxiaoping <1754789345@qq.com> Date: Thu Nov 23 09:06:45 2023 +0800 [MINOR] Remove unused import (apache#10159) commit f88a73f Author: Y Ethan Guo <ethan.guoyihua@gmail.com> Date: Wed Nov 22 10:48:48 2023 -0800 [HUDI-7123] Improve CI scripts (apache#10136) Improves the CI scripts in the following aspects: - Removes `hudi-common` tests from `test-spark` job in GH CI as they are already covered by Azure CI - Removes unnecesary bundle validation jobs and adds new bundle validation images (`flink1153hive313spark323`, `flink1162hive313spark331`) - Updates `validate-release-candidate-bundles` jobs - Moves functional tests of `hudi-spark-datasource/hudi-spark` from job 4 (3 hours) to job 2 (1 hour) in Azure CI to rebalance the finish time. commit 38c87b7 Author: harshal <harshal.j.patil@gmail.com> Date: Wed Nov 22 20:53:42 2023 +0530 [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources (apache#10152) commit d0edfb5 Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Wed Nov 22 10:22:53 2023 -0500 [HUDI-6961] Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custome delete marker (apache#10150) - Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custom delete marker across all delete apis commit cda9dbc Author: Jing Zhang <beyond1920@gmail.com> Date: Wed Nov 22 18:04:39 2023 +0800 [HUDI-7129] Fix bug when upgrade from table version three using UpgradeOrDowngradeProcedure (apache#10147) commit 18f7181 Author: Shiyan Xu <2701446+xushiyan@users.noreply.github.com> Date: Wed Nov 22 02:00:27 2023 -0600 [HUDI-7133] Improve dbt example for better guidance (apache#10155) commit c5af85d Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Wed Nov 22 01:33:49 2023 -0500 [HUDI-7096] Improving incremental query to fetch partitions based on commit metadata (apache#10098) commit 2522f6d Author: xuzifu666 <xuyu@zepp.com> Date: Wed Nov 22 11:53:21 2023 +0800 [HUDI-7128] DeleteMarkerProcedures support delete in batch mode (apache#10148) Co-authored-by: xuyu <11161569@vivo.com> commit a1afcdd Author: Tim Brown <tim@onehouse.ai> Date: Tue Nov 21 14:58:12 2023 -0600 [HUDI-7115] Add in new options for the bigquery sync (apache#10125) - Add in new options for the bigquery sync commit 35cd873 Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Tue Nov 21 13:11:21 2023 -0500 [HUDI-7084] Fixing schema retrieval for table w/ no commits (apache#10069) * Fixing schema retrieval for table w/ no commits * fixing compilation failure commit 74793d5 Author: Rajesh Mahindra <76502047+rmahindra123@users.noreply.github.com> Date: Tue Nov 21 09:53:12 2023 -0800 [HUDI-7106] Fix sqs deletes, deltasync service close and error table default configs. (apache#10117) Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> commit b981877 Author: harshal <harshal.j.patil@gmail.com> Date: Tue Nov 21 22:52:28 2023 +0530 [HUDI-7003] Add option to fallback to full table scan if files are deleted due to cleaner (apache#9941) commit 600fd4d Author: Akira Ajisaka <akiraaj@amazon.com> Date: Wed Nov 22 01:24:37 2023 +0900 [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format (apache#9567) * [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format This reverts commit 2567ada. Conflicts: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadTableInputFormat.java * Always use file index if files partition is available --------- Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> commit 9e2500c Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Tue Nov 21 09:55:23 2023 -0500 [HUDI-7083] Adding support for multiple tables with Prometheus Reporter (apache#10068) * Adding support for multiple tables with Prometheus Reporter * Fixing closure of http server * Remove entry from port-collector registry map after stopping http server --------- Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> commit baffe1d Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Tue Nov 21 09:32:39 2023 -0500 [MINOR] Misc fixes in deltastreamer (apache#10067) commit 0c4f3a3 Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Tue Nov 21 02:17:13 2023 -0500 [HUDI-7127] Fixing set up and tear down in tests (apache#10146) commit eaba114 Author: Akira Ajisaka <akiraaj@amazon.com> Date: Tue Nov 21 11:37:47 2023 +0900 [HUDI-7107] Reused MetricsReporter fails to publish metrics in Spark streaming job (apache#10132) commit 578e756 Author: Jing Zhang <beyond1920@gmail.com> Date: Tue Nov 21 10:04:33 2023 +0800 [HUDI-7118] Set conf 'spark.sql.parquet.enableVectorizedReader' to true automatically only if the value is not explicitly set (apache#10134) commit d24220a Author: Jing Zhang <beyond1920@gmail.com> Date: Tue Nov 21 09:56:07 2023 +0800 [HUDI-7111] Fix performance regression of tag when written into simple bucket index table (apache#10130) commit 84990ae Author: Rajesh Mahindra <76502047+rmahindra123@users.noreply.github.com> Date: Mon Nov 20 11:17:45 2023 -0800 Fix schema refresh for KafkaAvroSchemaDeserializer (apache#10118) Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> commit 979132b Author: majian <47964462+majian1998@users.noreply.github.com> Date: Mon Nov 20 10:43:11 2023 +0800 [HUDI-7099] Providing metrics for archive and defining some string constants (apache#10101) commit 3225625 Author: Fabio Buso <dev.siroibaf@gmail.com> Date: Mon Nov 20 03:19:41 2023 +0100 [MINOR] Add Hopsworks File System to StorageSchemes (apache#10141) commit 3913dca Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Sat Nov 18 23:50:37 2023 -0500 [HUDI-7098] Add max bytes per partition with cloud stores source in DS (apache#10100) commit 4c295b2 Author: hehuiyuan <471627698@qq.com> Date: Sun Nov 19 09:43:52 2023 +0800 [HUDI-7119] Don't write precombine field to hoodie.properties when the ts field does not exist for append mode (apache#10133) commit b2f4493 Author: Jing Zhang <beyond1920@gmail.com> Date: Sun Nov 19 09:35:54 2023 +0800 [HUDI-7072] Remove support for Flink 1.13 (apache#10052) commit dfe1674 Author: Sagar Lakshmipathy <18vidhyasagar@gmail.com> Date: Fri Nov 17 18:43:07 2023 -0800 [Minor] Fixed twitter link to redirect to twitter (apache#10139) commit f58d9cb Author: Jonathan Vexler <=> Date: Fri Nov 17 18:10:00 2023 -0500 current point commit 184858b Author: Jonathan Vexler <=> Date: Fri Nov 17 16:21:56 2023 -0500 non-working. Want to review with team that this makes sense commit 8240b6a Author: Y Ethan Guo <ethan.guoyihua@gmail.com> Date: Fri Nov 17 11:20:57 2023 -0800 [HUDI-7113] Update release scripts and docs for Spark 3.5 support (apache#10123) commit 216aeb4 Author: Danny Chan <yuzhao.cyz@gmail.com> Date: Fri Nov 17 14:35:17 2023 +0800 [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 (apache#10126) commit 3d0c450 Author: YueZhang <69956021+zhangyue19921010@users.noreply.github.com> Date: Fri Nov 17 09:48:59 2023 +0800 [HUDI-7109] Fix Flink may re-use a committed instant in append mode (apache#10119) commit f06ff5b Author: hehuiyuan <471627698@qq.com> Date: Fri Nov 17 09:43:21 2023 +0800 [HUDI-7090] Set the maxParallelism for singleton operator (apache#10090) commit faa73e9 Author: Y Ethan Guo <ethan.guoyihua@gmail.com> Date: Thu Nov 16 12:12:22 2023 -0800 [MINOR] Disable failed test on master (apache#10124) commit 6cc39bf Author: Sivabalan Narayanan <n.siva.b@gmail.com> Date: Thu Nov 16 06:00:54 2023 -0500 [MINOR] Removing unnecessary guards to row writer (apache#10004) commit 4ea752f Author: voonhous <voonhousu@gmail.com> Date: Thu Nov 16 16:53:28 2023 +0800 [MINOR] Modified description to include missing trigger strategy (apache#10114) commit 874b5de Author: Shawn Chang <42792772+CTTY@users.noreply.github.com> Date: Wed Nov 15 21:57:14 2023 -0800 [HUDI-6806] Support Spark 3.5.0 (apache#9717) --------- Co-authored-by: Shawn Chang <yxchang@amazon.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> commit 35af64d Author: Shawn Chang <42792772+CTTY@users.noreply.github.com> Date: Wed Nov 15 18:36:42 2023 -0800 [Minor] Throw exceptions when cleaner/compactor fail (apache#10108) Co-authored-by: Shawn Chang <yxchang@amazon.com> commit bada5d9 Author: Shawn Chang <42792772+CTTY@users.noreply.github.com> Date: Wed Nov 15 16:50:38 2023 -0800 [HUDI-5936] Fix serialization problem when FileStatus is not serializable (apache#10065) Co-authored-by: Shawn Chang <yxchang@amazon.com> commit dcd5a81 Author: majian <47964462+majian1998@users.noreply.github.com> Date: Wed Nov 15 16:10:15 2023 +0800 [HUDI-7069] Optimize metaclient construction and include table config options (apache#10048) commit f218e54 Author: Jing Zhang <beyond1920@gmail.com> Date: Wed Nov 15 16:07:04 2023 +0800 [MINOR] Add detailed error logs in RunCompactionProcedure (apache#10070) * add detailed error logs in RunCompactionProcedure * only print 100 error file paths into logs commit 2185abb Author: Jing Zhang <beyond1920@gmail.com> Date: Wed Nov 15 16:03:23 2023 +0800 [HUDI-7094] AlterTableAddColumnCommand/AlterTableChangeColumnCommand update table with ro/rt suffix (apache#10094) commit abd3afc Author: Hussein Awala <hussein@awala.fr> Date: Wed Nov 15 06:55:47 2023 +0200 [HUDI-6695] Use the AWS provider chain in Glue sync and add a new provider for STS assume role (apache#9260) commit 424e0ce Author: chao chen <59957056+waywtdcc@users.noreply.github.com> Date: Wed Nov 15 12:20:10 2023 +0800 [HUDI-7050] Flink HoodieHiveCatalog supports hadoop parameters (apache#10013) commit 19b3e7f Author: leixin <1403342953@qq.com> Date: Wed Nov 15 09:24:29 2023 +0800 [Minor] Throws an exception when using bulk_insert and stream mode (apache#10082) Co-authored-by: leixin1 <leixin1@jd.com>
In the current implementation of run multi tables services, the clustering task and compaction task both build metaclient repeatedly for each table, causing additional overhead. To reduce this overhead, we extract the construction of metaclient and only construct it once for each table, passing it as a parameter to the corresponding task.
At the same time, when running multi tables services, the write config lacks some information from the table config, such as the table name. This leads to empty strings when retrieving the table name in certain situations. For example, when configuring the prefix for metrics, if not specified, the table name is used as the prefix. However, in the current situation, without the table config, it's impossible to differentiate the metrics of different tables, resulting in an empty prefix. By adding the table config to the write config beforehand, we can obtain all the configuration information in the subsequent write config step.
Taking a metrics example, it would be displayed as follows before the fix:
".replacecommit.totalFilesInsert":"2"
After the fix, it would be displayed as follows:
"test_tb.replacecommit.totalFilesInsert":"2"
Additionally, we made a small modification by removing the redundant construction of metaclient in the clusteringJob's constructor.
Change Logs
We now construct metaclient only once per table and add table config to the write config to obtain all necessary information. Furthermore, unnecessary construction of metaclient in the clusteringJob's constructor has been removed.
Impact
None
Risk level (write none, low medium or high below)
None
Documentation Update
None
Contributor's checklist