HIVE-28755: Statistics Management Task#6199
HIVE-28755: Statistics Management Task#6199DanielZhu58 wants to merge 0 commit intoapache:masterfrom
Conversation
0bdbda0 to
3db2848
Compare
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...tore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
|
| @Override | ||
| public void setConf(Configuration configuration) { | ||
| // we modify conf in setupConf(), so we make a copy | ||
| conf = new Configuration(configuration); |
There was a problem hiding this comment.
why create a new conf object? Why not directly assign it? IMO, it is a costly operation to create conf object
There was a problem hiding this comment.
Acknowledged. Changed.
| // delete all column stats | ||
| @Override | ||
| public void run() { | ||
| if (LOG.isDebugEnabled()) { |
There was a problem hiding this comment.
nit: you don't need this if condition, log.debug() will print to debug logs if debug logging is enabled.
| PersistenceManager pm = ((ObjectStore) ms).getPersistenceManager(); | ||
| Query q = pm.newQuery(MTableColumnStatistics.class); | ||
| q.setFilter(filter); | ||
| q.declareParameters(paramStr); |
There was a problem hiding this comment.
can you also set result fields before querying like this: https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1933? That way we can avoid fetching heavy weight 'Mtable' object and other useless stats in the MTableColumnStatistics class.
There was a problem hiding this comment.
Acknowledged. Changed.
| String tblName = stat.getTable().getTableName(); | ||
| Map<String, String> tblParams = stat.getTable().getParameters(); | ||
| if (tblParams != null && tblParams.getOrDefault(STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY, null) != null) { | ||
| LOG.info("Skipping table {}.{} due to exclude property.", dbName, tblName); |
There was a problem hiding this comment.
We can make this message more intuitive.
nit: "Skipping auto deletion of stats for table {}.{} due to {} property being set on the table.", dbName, tblName
| final StatisticsManagementTask statsTask = new StatisticsManagementTask(); | ||
| final FileSystem fs; | ||
| try { | ||
| fs = FileSystem.get(client.getHadoopConf()); |
There was a problem hiding this comment.
nit: unused variable
| } | ||
|
|
||
| Runnable preRun = () -> { | ||
| System.out.println("Preparing for benchmark..."); |
There was a problem hiding this comment.
nit: replace this with log.info or log.debug
| try { | ||
| fs = FileSystem.get(client.getHadoopConf()); | ||
| client.getHadoopConf().set("hive.metastore.uris", client.getServerURI().toString()); | ||
| client.getHadoopConf().set("metastore.statistics.management.database.pattern", dbName); |
There was a problem hiding this comment.
What is this config? we don't have it in our metastoreconf right?
There was a problem hiding this comment.
Yes, this config key is not in MetastoreConf .
It was intended as a DB filter for StatisticsManagementTask. We can remove it from the HMSBenchmarks.java
| Map<String, String> params = partition.getParameters(); | ||
| // to manually change the "lastAnalyzed" to an old time, ex. 400 days | ||
| params.put("lastAnalyzed", String.valueOf(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(400))); | ||
| client.alterPartition(dbName, tableName, partition); |
There was a problem hiding this comment.
we could improve this by calling alterPartitions() and changing all partitions in one go.
| } catch (Exception e) { | ||
| throw new RuntimeException(e); | ||
| } | ||
| } |
There was a problem hiding this comment.
I think we can also test a scenario where auto stats are skip by setting the table property and verify that stats exists even after running stats management task.
I think it would be better to test it out in a different test class instead of benchmark test
|
saihemanth-cloudera
left a comment
There was a problem hiding this comment.
Incorrect rebase. The latest PR inadvertently pulled in 32 commits into your patch. Please correct it.
7b99abd to
9e7f289
Compare
|
During the rebase and sync process, I accidentally closed this PR. |



What changes were proposed in this pull request?
To add a new StatisticsManagementTask.java to automatically delete the old stats.
Why are the changes needed?
To help reduce the old or stale stats.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual tests and unit tests.
For reviewers: What this PR does
This PR introduces a new “Statistics Management Task” in the Hive metastore which periodically auto-deletes stale column statistics, plus configuration knobs.
In MetastoreConf.java
Three new configuration variables are added:
STATISTICS_MANAGEMENT_TASK_FREQUENCY
Meaning: Controls how often the StatisticsManagementTask runs, for tables that have statistics.auto.deletion=true in their table properties.
STATISTICS_RETENTION_PERIOD
Meaning: The retention period for stats. If a table/partition’s stats are older than this, they become candidates for auto deletion.
STATISTICS_AUTO_DELETION
In StatisticsManagementTask.java
Defines a new StatisticsManagementTask implementing MetastoreTaskThread. Its purpose is to:
Fetch STATISTICS_RETENTION_PERIOD and STATISTICS_AUTO_DELETION from conf. If retention <= 0 or auto deletion is disabled, log and return.
Compute lastAnalyzedThreshold = (now - retentionMillis) / 1000 (in seconds).
Use HMSHandler.getMSForConf(conf) to get RawStore and a PersistenceManager, then query MTableColumnStatistics rows where lastAnalyzed < threshold.
In short, this class implements a background cleanup task that scans MTableColumnStatistics for stale entries and deletes them via the metastore client.
In BenchmarkTool.java
BenchmarkTool can now benchmark the new statistics management task for different numbers of tables.
In HMSBenchmarks.java Test
Constructs a dedicated database name and table prefix based on tableCount and BenchData.
Gets an HMSClient and instantiates a StatisticsManagementTask.
Configures the client Hadoop conf:
hive.metastore.uris = metastore URI
metastore.statistics.management.database.pattern = dbName (so the task focuses on this DB)
Sets the task’s conf and creates the database and tableCount tables:
Simulates old stats:
For each partition, sets lastAnalyzed to now - 400 days in the partition parameters and alters the partition.
Post-run assertion:
Re-scans all partitions; if any partition parameters still contain lastAnalyzed, it throws an AssertionError("Partition stats not deleted for table: " + tableName).
In other words, this is an end-to-end microbenchmark for the new StatisticsManagementTask that both measures performance and verifies that “old” partition stats are actually cleaned up.