Skip to content

Commit 7222f05

Browse files
committed
Improve stats sample strategy.
1 parent 7e62c3c commit 7222f05

File tree

18 files changed

+827
-201
lines changed

18 files changed

+827
-201
lines changed

docs/en/docs/query-acceleration/statistics.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,7 @@ Syntax:
5454
```SQL
5555
ANALYZE < TABLE | DATABASE table_name | db_name >
5656
[ (column_name [, ...]) ]
57-
[ [ WITH SYNC ] [ WITH SAMPLE PERCENT | ROWS ] [ WITH SQL ] ]
58-
[ PROPERTIES ("key" = "value", ...) ];
57+
[ [ WITH SYNC ] [ WITH SAMPLE PERCENT | ROWS ] ];
5958
```
6059

6160
Where:
@@ -64,7 +63,6 @@ Where:
6463
- `column_name`: The specified target column. It must be an existing column in `table_name`. You can specify multiple column names separated by commas.
6564
- `sync`: Collect statistics synchronously. Returns after collection. If not specified, it executes asynchronously and returns a JOB ID.
6665
- `sample percent | rows`: Collect statistics with sampling. You can specify a sampling percentage or a number of sampling rows.
67-
- `sql`: Execute SQL to collect statistics for partitioned columns in external tables. By default, partitioned column information is collected from metadata, which is efficient but may not be accurate in terms of row count and data size. Users can specify using SQL to collect accurate partitioned column information.
6866

6967
Here are some examples:
7068

@@ -90,6 +88,13 @@ The collection jobs for statistics themselves consume a certain amount of system
9088

9189
If you are concerned about automatic collection jobs interfering with your business, you can specify a time frame for the automatic collection jobs to run during low business loads by setting the `full_auto_analyze_start_time` and `full_auto_analyze_end_time` parameters according to your needs. You can also completely disable this feature by setting the `enable_full_auto_analyze` parameter to `false`.
9290

91+
External catalogs do not participate in automatic collection by default. Because external catalogs often contain massive historical data, if they participate in automatic collection, it may occupy too many resources. You can turn on and off the automatic collection of external catalogs by setting the catalog's properties.
92+
93+
```sql
94+
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='true'); // Turn on
95+
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='false'); // Turn off
96+
```
97+
9398
<br />
9499

95100

docs/zh-CN/docs/query-acceleration/statistics.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,7 @@ Doris支持用户通过提交ANALYZE语句来手动触发统计信息的收集
5656
```SQL
5757
ANALYZE < TABLE | DATABASE table_name | db_name >
5858
[ (column_name [, ...]) ]
59-
[ [ WITH SYNC ] [ WITH SAMPLE PERCENT | ROWS ] [ WITH SQL ] ]
60-
[ PROPERTIES ("key" = "value", ...) ];
59+
[ [ WITH SYNC ] [ WITH SAMPLE PERCENT | ROWS ] ];
6160
```
6261

6362
其中:
@@ -66,7 +65,6 @@ ANALYZE < TABLE | DATABASE table_name | db_name >
6665
- column_name: 指定的目标列。必须是  `table_name`  中存在的列,多个列名称用逗号分隔。
6766
- sync:同步收集统计信息。收集完后返回。若不指定则异步执行并返回JOB ID。
6867
- sample percent | rows:抽样收集统计信息。可以指定抽样比例或者抽样行数。
69-
- sql:执行sql来收集外表分区列统计信息。默认从元数据收集分区列信息,这样效率比较高但是行数和数据量大小可能不准。用户可以指定使用sql来收集,这样可以收集到准确的分区列信息。
7068

7169

7270
下面是一些例子
@@ -93,6 +91,13 @@ ANALYZE TABLE lineitem WITH SAMPLE ROWS 100000;
9391

9492
如果担心自动收集作业对业务造成干扰,可结合自身需求通过设置参数`full_auto_analyze_start_time`和参数`full_auto_analyze_end_time`指定自动收集作业在业务负载较低的时间段执行。也可以通过设置参数`enable_full_auto_analyze``false`来彻底关闭本功能。
9593

94+
External catalog 默认不参与自动收集。因为 external catalog 往往包含海量历史数据,如果参与自动收集,可能占用过多资源。可以通过设置 catalog 的 property 来打开和关闭 external catalog 的自动收集。
95+
96+
```sql
97+
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='true'); // 打开自动收集
98+
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='false'); // 关闭自动收集
99+
```
100+
96101
<br />
97102

98103
## 2. 作业管理

fe/fe-core/src/main/java/org/apache/doris/catalog/OlapTable.java

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@
9090
import java.util.Collections;
9191
import java.util.HashMap;
9292
import java.util.List;
93+
import java.util.Locale;
9394
import java.util.Map;
9495
import java.util.Objects;
9596
import java.util.Optional;
@@ -2371,4 +2372,17 @@ public void analyze(String dbName) {
23712372
}
23722373
}
23732374
}
2375+
2376+
@Override
2377+
public boolean isDistributionColumn(String columnName) {
2378+
Set<String> distributeColumns = getDistributionColumnNames()
2379+
.stream().map(String::toLowerCase).collect(Collectors.toSet());
2380+
return distributeColumns.contains(columnName.toLowerCase(Locale.ROOT));
2381+
}
2382+
2383+
@Override
2384+
public boolean isPartitionColumn(String columnName) {
2385+
return getPartitionInfo().getPartitionColumns().stream()
2386+
.anyMatch(c -> c.getName().equalsIgnoreCase(columnName));
2387+
}
23742388
}

fe/fe-core/src/main/java/org/apache/doris/catalog/TableIf.java

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,5 +254,13 @@ default long getDataSize(boolean singleReplica) {
254254
// TODO: Each tableIf should impl it by itself.
255255
return 0;
256256
}
257+
258+
default boolean isDistributionColumn(String columnName) {
259+
return false;
260+
}
261+
262+
default boolean isPartitionColumn(String columnName) {
263+
return false;
264+
}
257265
}
258266

fe/fe-core/src/main/java/org/apache/doris/catalog/external/HMSExternalTable.java

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -699,6 +699,12 @@ public long getDataSize(boolean singleReplica) {
699699
}
700700
return total;
701701
}
702+
703+
@Override
704+
public boolean isDistributionColumn(String columnName) {
705+
return getRemoteTable().getSd().getBucketCols().stream().map(String::toLowerCase)
706+
.collect(Collectors.toSet()).contains(columnName.toLowerCase(Locale.ROOT));
707+
}
702708
}
703709

704710

fe/fe-core/src/main/java/org/apache/doris/statistics/AnalysisJob.java

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,16 @@ public synchronized void appendBuf(BaseAnalysisTask task, List<ColStatsData> sta
7676
queryingTask.remove(task);
7777
buf.addAll(statsData);
7878
queryFinished.add(task);
79+
markOneTaskDone();
80+
}
81+
82+
public synchronized void rowCountDone(BaseAnalysisTask task) {
83+
queryingTask.remove(task);
84+
queryFinished.add(task);
85+
markOneTaskDone();
86+
}
87+
88+
protected void markOneTaskDone() {
7989
queryFinishedTaskCount += 1;
8090
if (queryFinishedTaskCount == totalTaskCount) {
8191
writeBuf();
@@ -183,6 +193,9 @@ public void deregisterJob() {
183193
protected void syncLoadStats() {
184194
long tblId = jobInfo.tblId;
185195
for (BaseAnalysisTask task : queryFinished) {
196+
if (task.info.externalTableLevelTask) {
197+
continue;
198+
}
186199
String colName = task.col.getName();
187200
if (!Env.getCurrentEnv().getStatisticsCache().syncLoadColStats(tblId, -1, colName)) {
188201
analysisManager.removeColStatsStatus(tblId, colName);

fe/fe-core/src/main/java/org/apache/doris/statistics/AnalysisManager.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -333,11 +333,11 @@ protected AnalysisInfo buildAndAssignJob(AnalyzeTblStmt stmt) throws DdlExceptio
333333
boolean isSync = stmt.isSync();
334334
Map<Long, BaseAnalysisTask> analysisTaskInfos = new HashMap<>();
335335
createTaskForEachColumns(jobInfo, analysisTaskInfos, isSync);
336-
constructJob(jobInfo, analysisTaskInfos.values());
337336
if (!jobInfo.partitionOnly && stmt.isAllColumns()
338337
&& StatisticsUtil.isExternalTable(jobInfo.catalogId, jobInfo.dbId, jobInfo.tblId)) {
339338
createTableLevelTaskForExternalTable(jobInfo, analysisTaskInfos, isSync);
340339
}
340+
constructJob(jobInfo, analysisTaskInfos.values());
341341
if (isSync) {
342342
syncExecute(analysisTaskInfos.values());
343343
updateTableStats(jobInfo);

fe/fe-core/src/main/java/org/apache/doris/statistics/BaseAnalysisTask.java

Lines changed: 107 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -36,53 +36,88 @@
3636
import org.apache.logging.log4j.LogManager;
3737
import org.apache.logging.log4j.Logger;
3838

39+
import java.text.MessageFormat;
3940
import java.util.Collections;
4041
import java.util.concurrent.TimeUnit;
4142

4243
public abstract class BaseAnalysisTask {
4344

4445
public static final Logger LOG = LogManager.getLogger(BaseAnalysisTask.class);
4546

46-
protected static final String NDV_MULTIPLY_THRESHOLD = "0.3";
47-
48-
protected static final String NDV_SAMPLE_TEMPLATE = "case when NDV(`${colName}`)/count('${colName}') < "
49-
+ NDV_MULTIPLY_THRESHOLD
50-
+ " then NDV(`${colName}`) "
51-
+ "else NDV(`${colName}`) * ${scaleFactor} end AS ndv, "
52-
;
47+
public static final long LIMIT_SIZE = 1024 * 1024 * 1024; // 1GB
48+
public static final double LIMIT_FACTOR = 1.2;
5349

5450
protected static final String COLLECT_COL_STATISTICS =
55-
"SELECT CONCAT(${tblId}, '-', ${idxId}, '-', '${colId}') AS id, "
56-
+ " ${catalogId} AS catalog_id, "
57-
+ " ${dbId} AS db_id, "
58-
+ " ${tblId} AS tbl_id, "
59-
+ " ${idxId} AS idx_id, "
60-
+ " '${colId}' AS col_id, "
61-
+ " NULL AS part_id, "
62-
+ " COUNT(1) AS row_count, "
63-
+ " NDV(`${colName}`) AS ndv, "
64-
+ " COUNT(1) - COUNT(${colName}) AS null_count, "
65-
+ " CAST(MIN(${colName}) AS STRING) AS min, "
66-
+ " CAST(MAX(${colName}) AS STRING) AS max, "
67-
+ " ${dataSizeFunction} AS data_size, "
68-
+ " NOW() AS update_time "
69-
+ " FROM `${dbName}`.`${tblName}`";
70-
71-
protected static final String ANALYZE_PARTITION_COLUMN_TEMPLATE =
72-
" SELECT "
73-
+ "CONCAT(${tblId}, '-', ${idxId}, '-', '${colId}') AS id, "
74-
+ "${catalogId} AS catalog_id, "
75-
+ "${dbId} AS db_id, "
76-
+ "${tblId} AS tbl_id, "
77-
+ "${idxId} AS idx_id, "
78-
+ "'${colId}' AS col_id, "
79-
+ "NULL AS part_id, "
80-
+ "${row_count} AS row_count, "
81-
+ "${ndv} AS ndv, "
82-
+ "${null_count} AS null_count, "
83-
+ "'${min}' AS min, "
84-
+ "'${max}' AS max, "
85-
+ "${data_size} AS data_size, "
51+
"SELECT CONCAT(${tblId}, '-', ${idxId}, '-', '${colId}') AS `id`, "
52+
+ " ${catalogId} AS `catalog_id`, "
53+
+ " ${dbId} AS `db_id`, "
54+
+ " ${tblId} AS `tbl_id`, "
55+
+ " ${idxId} AS `idx_id`, "
56+
+ " '${colId}' AS `col_id`, "
57+
+ " NULL AS `part_id`, "
58+
+ " COUNT(1) AS `row_count`, "
59+
+ " NDV(`${colName}`) AS `ndv`, "
60+
+ " COUNT(1) - COUNT(${colName}) AS `null_count`, "
61+
+ " CAST(MIN(${colName}) AS STRING) AS `min`, "
62+
+ " CAST(MAX(${colName}) AS STRING) AS `max`, "
63+
+ " ${dataSizeFunction} AS `data_size`, "
64+
+ " NOW() AS `update_time` "
65+
+ " FROM `${catalogName}`.`${dbName}`.`${tblName}`";
66+
67+
protected static final String LINEAR_ANALYZE_TEMPLATE = " SELECT "
68+
+ "CONCAT(${tblId}, '-', ${idxId}, '-', '${colId}') AS `id`, "
69+
+ "${catalogId} AS `catalog_id`, "
70+
+ "${dbId} AS `db_id`, "
71+
+ "${tblId} AS `tbl_id`, "
72+
+ "${idxId} AS `idx_id`, "
73+
+ "'${colId}' AS `col_id`, "
74+
+ "NULL AS `part_id`, "
75+
+ "ROUND(COUNT(1) * ${scaleFactor}) AS `row_count`, "
76+
+ "ROUND(NDV(`${colName}`) * ${scaleFactor}) as `ndv`, "
77+
+ "ROUND(SUM(CASE WHEN `${colName}` IS NULL THEN 1 ELSE 0 END) * ${scaleFactor}) AS `null_count`, "
78+
+ "${min} AS `min`, "
79+
+ "${max} AS `max`, "
80+
+ "${dataSizeFunction} * ${scaleFactor} AS `data_size`, "
81+
+ "NOW() "
82+
+ "FROM `${catalogName}`.`${dbName}`.`${tblName}` ${sampleHints} ${limit}";
83+
84+
protected static final String DUJ1_ANALYZE_TEMPLATE = "SELECT "
85+
+ "CONCAT('${tblId}', '-', '${idxId}', '-', '${colId}') AS `id`, "
86+
+ "${catalogId} AS `catalog_id`, "
87+
+ "${dbId} AS `db_id`, "
88+
+ "${tblId} AS `tbl_id`, "
89+
+ "${idxId} AS `idx_id`, "
90+
+ "'${colId}' AS `col_id`, "
91+
+ "NULL AS `part_id`, "
92+
+ "${rowCount} AS `row_count`, "
93+
+ "${ndvFunction} as `ndv`, "
94+
+ "IFNULL(SUM(IF(`t1`.`column_key` IS NULL, `t1`.count, 0)), 0) * ${scaleFactor} as `null_count`, "
95+
+ "'${min}' AS `min`, "
96+
+ "'${max}' AS `max`, "
97+
+ "${dataSizeFunction} * ${scaleFactor} AS `data_size`, "
98+
+ "NOW() "
99+
+ "FROM ( "
100+
+ " SELECT t0.`${colName}` as column_key, COUNT(1) as `count` "
101+
+ " FROM "
102+
+ " (SELECT `${colName}` FROM `${catalogName}`.`${dbName}`.`${tblName}` "
103+
+ " ${sampleHints} ${limit}) as `t0` "
104+
+ " GROUP BY `t0`.`${colName}` "
105+
+ ") as `t1` ";
106+
107+
protected static final String ANALYZE_PARTITION_COLUMN_TEMPLATE = " SELECT "
108+
+ "CONCAT(${tblId}, '-', ${idxId}, '-', '${colId}') AS `id`, "
109+
+ "${catalogId} AS `catalog_id`, "
110+
+ "${dbId} AS `db_id`, "
111+
+ "${tblId} AS `tbl_id`, "
112+
+ "${idxId} AS `idx_id`, "
113+
+ "'${colId}' AS `col_id`, "
114+
+ "NULL AS `part_id`, "
115+
+ "${row_count} AS `row_count`, "
116+
+ "${ndv} AS `ndv`, "
117+
+ "${null_count} AS `null_count`, "
118+
+ "'${min}' AS `min`, "
119+
+ "'${max}' AS `max`, "
120+
+ "${data_size} AS `data_size`, "
86121
+ "NOW() ";
87122

88123
protected AnalysisInfo info;
@@ -199,29 +234,51 @@ public long getJobId() {
199234
return info.jobId;
200235
}
201236

202-
// TODO : time cost is intolerable when column is string type, return 0 directly for now.
203-
protected String getDataSizeFunction(Column column) {
204-
if (column.getType().isStringType()) {
205-
return "SUM(LENGTH(`${colName}`))";
237+
protected String getDataSizeFunction(Column column, boolean useDuj1) {
238+
if (useDuj1) {
239+
if (column.getType().isStringType()) {
240+
return "SUM(LENGTH(`column_key`) * count)";
241+
} else {
242+
return "SUM(t1.count) * " + column.getType().getSlotSize();
243+
}
244+
} else {
245+
if (column.getType().isStringType()) {
246+
return "SUM(LENGTH(`${colName}`))";
247+
} else {
248+
return "COUNT(1) * " + column.getType().getSlotSize();
249+
}
206250
}
207-
return "COUNT(1) * " + column.getType().getSlotSize();
208251
}
209252

210-
// Min value is not accurate while sample, so set it to NULL to avoid optimizer generate bad plan.
211253
protected String getMinFunction() {
212254
if (tableSample == null) {
213-
return "CAST(MIN(`${colName}`) as ${type}) ";
255+
return "to_base64(CAST(MIN(`${colName}`) as ${type})) ";
214256
} else {
215-
return "NULL ";
257+
// Min value is not accurate while sample, so set it to NULL to avoid optimizer generate bad plan.
258+
return "NULL";
216259
}
217260
}
218261

262+
protected String getNdvFunction(String totalRows) {
263+
String sampleRows = "SUM(t1.count)";
264+
String onceCount = "SUM(IF(t1.count = 1, 1, 0))";
265+
String countDistinct = "COUNT(1)";
266+
// DUJ1 estimator: n*d / (n - f1 + f1*n/N)
267+
// f1 is the count of element that appears only once in the sample.
268+
// (https://github.com/postgres/postgres/blob/master/src/backend/commands/analyze.c)
269+
// (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.8637&rep=rep1&type=pdf)
270+
// sample_row * count_distinct / ( sample_row - once_count + once_count * sample_row / total_row)
271+
String fn = MessageFormat.format("{0} * {1} / ({0} - {2} + {2} * {0} / {3})", sampleRows,
272+
countDistinct, onceCount, totalRows);
273+
return fn;
274+
}
275+
219276
// Max value is not accurate while sample, so set it to NULL to avoid optimizer generate bad plan.
220277
protected String getMaxFunction() {
221278
if (tableSample == null) {
222-
return "CAST(MAX(`${colName}`) as ${type}) ";
279+
return "to_base64(CAST(MAX(`${colName}`) as ${type})) ";
223280
} else {
224-
return "NULL ";
281+
return "NULL";
225282
}
226283
}
227284

@@ -254,12 +311,11 @@ public void setJob(AnalysisJob job) {
254311
this.job = job;
255312
}
256313

257-
protected void runQuery(String sql) {
314+
protected void runQuery(String sql, boolean needEncode) {
258315
long startTime = System.currentTimeMillis();
259316
try (AutoCloseConnectContext a = StatisticsUtil.buildConnectContext()) {
260317
stmtExecutor = new StmtExecutor(a.connectContext, sql);
261-
stmtExecutor.executeInternalQuery();
262-
ColStatsData colStatsData = new ColStatsData(stmtExecutor.executeInternalQuery().get(0));
318+
ColStatsData colStatsData = new ColStatsData(stmtExecutor.executeInternalQuery().get(0), needEncode);
263319
job.appendBuf(this, Collections.singletonList(colStatsData));
264320
} finally {
265321
LOG.debug("End cost time in secs: " + (System.currentTimeMillis() - startTime) / 1000);

0 commit comments

Comments
 (0)