Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tispark 3.x version read speed is slower than tispark 2.5 #2573

Closed
wfxxh opened this issue Oct 20, 2022 · 2 comments · Fixed by #2578
Closed

tispark 3.x version read speed is slower than tispark 2.5 #2573

wfxxh opened this issue Oct 20, 2022 · 2 comments · Fixed by #2578

Comments

@wfxxh
Copy link
Contributor

wfxxh commented Oct 20, 2022

tidb version is 5.4.2
spark version 3.0 to 3.2
tispark version 3.0 to 3.1

I used tispark2.5 before, but when i upgrade tispark version to 3.x ,I find it is slow when read tikv than tispark 2.5 version。

table info :

image

spark conf :

image

tispark 2.5:

image

tispark 3.x:

image

@wfxxh
Copy link
Contributor Author

wfxxh commented Oct 31, 2022

I have found the reasion.It is because the v3.x version nowhere to call the 'StatisticsManager.loadStatisticsInfo' method,so the 'statisticsMap' in StatisticsManager is not filled, it cause inside the method 'TiStrategy.filterToDAGRequest' val tblStatistics: TableStatistics = StatisticsManager.getTableStatistics(source.table.getId) get null,so 'TiKVScanAnalyzer.buildIndexScan' can not return correct value

CREATE TABLE `perio_art_project` (
  `record_id` int(11) DEFAULT NULL,
  `article_id` varchar(255) DEFAULT NULL,
  `project_seq` int(11) DEFAULT NULL,
  `project_id` varchar(255) DEFAULT NULL,
  `project_name` longtext DEFAULT NULL,
  `batch_id` int(11) DEFAULT NULL,
  `primary_partition` int(4) GENERATED ALWAYS AS ((crc32(`article_id`)) % 9999) STORED NOT NULL,
  `last_modify_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `spark_update_time` datetime DEFAULT NULL,
  KEY `article_id` (`article_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
sparkSession
  .sql(
    """select * from tidb_catalog.qk_chi.perio_art_project
      |""".stripMargin)
  .groupBy("project_name")
  .count()
  .explain()
== 2.5 Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[project_name#7], functions=[count(1)])
   +- Exchange hashpartitioning(project_name#7, 8192), true, [id=#14]
      +- HashAggregate(keys=[project_name#7], functions=[partial_count(1)])
         +- TiKV CoprocessorRDD{[table: perio_art_project] TableScan, Columns: project_name@VARCHAR(4294967295), KeyRange: [([t\200\000\000\000\000\000\003\017_r\000\000\000\000\000\000\000\000], [t\200\000\000\000\000\000\003\017_s\000\000\000\000\000\000\000\000])], startTs: 437048830927044635} EstimatedCount:18072492
== 3.x Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[project_name#7], functions=[specialsum(count(1)#34L, LongType, 0)])
   +- Exchange hashpartitioning(project_name#7, 8192), true, [id=#13]
      +- HashAggregate(keys=[project_name#7], functions=[partial_specialsum(count(1)#34L, LongType, 0)])
         +- TiSpark RegionTaskExec{downgradeThreshold=1000000000,downgradeFilter=[]
            +- TiKV FetchHandleRDD{[table: perio_art_project] IndexLookUp, Columns: project_name@VARCHAR(4294967295): { {IndexRangeScan(Index:article_id(article_id)): { RangeFilter: [], Range: [([t\200\000\000\000\000\000\003\017_i\200\000\000\000\000\000\000\001\000], [t\200\000\000\000\000\000\003\017_i\200\000\000\000\000\000\000\001\372])] }}; {TableRowIDScan, Aggregates: Count(1), First(project_name@VARCHAR(4294967295)), Group By: [project_name@VARCHAR(4294967295) ASC]} }, startTs: 437048801724203020}

User can set spark.tispark.plan.allow_index_read=false to avoid

@xuanyu66
Copy link
Collaborator

xuanyu66 commented Nov 3, 2022

So sounds like it's a bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants