Skip to content

Commit

Permalink
docs: msq autocompaction (apache#16681)
Browse files Browse the repository at this point in the history
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Co-authored-by: Vishesh Garg <vishesh.garg@imply.io>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
  • Loading branch information
4 people authored Oct 17, 2024
1 parent 0e6c388 commit d1b81f3
Show file tree
Hide file tree
Showing 7 changed files with 237 additions and 78 deletions.
8 changes: 7 additions & 1 deletion docs/api-reference/automatic-compaction-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,13 @@ import TabItem from '@theme/TabItem';
~ under the License.
-->

This topic describes the status and configuration API endpoints for [automatic compaction](../data-management/automatic-compaction.md) in Apache Druid. You can configure automatic compaction in the Druid web console or API.
This topic describes the status and configuration API endpoints for [automatic compaction using Coordinator duties](../data-management/automatic-compaction.md#auto-compaction-using-coordinator-duties) in Apache Druid. You can configure automatic compaction in the Druid web console or API.

:::info Experimental

Instead of the automatic compaction API, you can use the supervisor API to submit auto-compaction jobs using compaction supervisors. For more information, see [Auto-compaction using compaction supervisors](../data-management/automatic-compaction.md#auto-compaction-using-compaction-supervisors).

:::

In this topic, `http://ROUTER_IP:ROUTER_PORT` is a placeholder for your Router service address and port. Replace it with the information for your deployment. For example, use `http://localhost:8888` for quickstart deployments.

Expand Down
3 changes: 2 additions & 1 deletion docs/configuration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1050,7 +1050,7 @@ The following table shows the supported configurations for auto-compaction.

|Property|Description|Required|
|--------|-----------|--------|
|type|The task type, this should always be `index_parallel`.|yes|
|type|The task type. If you're using Coordinator duties for auto-compaction, set it to `index_parallel`. If you're using compaction supervisors, set it to `autocompact`. |yes|
|`maxRowsInMemory`|Used in determining when intermediate persists to disk should occur. Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set.|no (default = 1000000)|
|`maxBytesInMemory`|Used in determining when intermediate persists to disk should occur. Normally this is computed internally and user does not need to set it. This value represents number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is `maxBytesInMemory` * (2 + `maxPendingPersists`)|no (default = 1/6 of max JVM memory)|
|`splitHintSpec`|Used to give a hint to control the amount of data that each first phase task reads. This hint could be ignored depending on the implementation of the input source. See [Split hint spec](../ingestion/native-batch.md#split-hint-spec) for more details.|no (default = size-based split hint spec)|
Expand All @@ -1067,6 +1067,7 @@ The following table shows the supported configurations for auto-compaction.
|`taskStatusCheckPeriodMs`|Polling period in milliseconds to check running task statuses.|no (default = 1000)|
|`chatHandlerTimeout`|Timeout for reporting the pushed segments in worker tasks.|no (default = PT10S)|
|`chatHandlerNumRetries`|Retries for reporting the pushed segments in worker tasks.|no (default = 5)|
|`engine` | Engine for compaction. Can be either `native` or `msq`. `msq` uses the MSQ task engine and is only supported with [compaction supervisors](../data-management/automatic-compaction.md#auto-compaction-using-compaction-supervisors). | no (default = native)|

###### Automatic compaction granularitySpec

Expand Down
276 changes: 207 additions & 69 deletions docs/data-management/automatic-compaction.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/ingestion/concurrent-append-replace.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ If you want to append data to a datasource while compaction is running, you need

In the **Compaction config** for a datasource, enable **Use concurrent locks (experimental)**.

For details on accessing the compaction config in the UI, see [Enable automatic compaction with the web console](../data-management/automatic-compaction.md#web-console).
For details on accessing the compaction config in the UI, see [Enable automatic compaction with the web console](../data-management/automatic-compaction.md#manage-auto-compaction-using-the-web-console).

### Update the compaction settings with the API

Expand Down
12 changes: 6 additions & 6 deletions docs/ingestion/supervisor.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,22 +23,22 @@ sidebar_label: Supervisor
~ under the License.
-->

A supervisor manages streaming ingestion from external streaming sources into Apache Druid.
Supervisors oversee the state of indexing tasks to coordinate handoffs, manage failures, and ensure that the scalability and replication requirements are maintained.
Apache Druid uses supervisors to manage streaming ingestion from external streaming sources into Druid.
Supervisors oversee the state of indexing tasks to coordinate handoffs, manage failures, and ensure that the scalability and replication requirements are maintained. They can also be used to perform [automatic compaction](../data-management/automatic-compaction.md) after data has been ingested.

This topic uses the Apache Kafka term offset to refer to the identifier for records in a partition. If you are using Amazon Kinesis, the equivalent is sequence number.

## Supervisor spec

Druid uses a JSON specification, often referred to as the supervisor spec, to define streaming ingestion tasks.
The supervisor spec specifies how Druid should consume, process, and index streaming data.
Druid uses a JSON specification, often referred to as the supervisor spec, to define tasks used for streaming ingestion or auto-compaction.
The supervisor spec specifies how Druid should consume, process, and index data from an external stream or Druid itself.

The following table outlines the high-level configuration options for a supervisor spec:

|Property|Type|Description|Required|
|--------|----|-----------|--------|
|`type`|String|The supervisor type. One of `kafka`or `kinesis`.|Yes|
|`spec`|Object|The container object for the supervisor configuration.|Yes|
|`type`|String|The supervisor type. For streaming ingestion, this can be either `kafka`, `kinesis`, or `rabbit`. For automatic compaction, set the type to `autocompact`. |Yes|
|`spec`|Object|The container object for the supervisor configuration. For automatic compaction, this is the same as the compaction configuration. |Yes|
|`spec.dataSchema`|Object|The schema for the indexing task to use during ingestion. See [`dataSchema`](../ingestion/ingestion-spec.md#dataschema) for more information.|Yes|
|`spec.ioConfig`|Object|The I/O configuration object to define the connection and I/O-related settings for the supervisor and indexing tasks.|Yes|
|`spec.tuningConfig`|Object|The tuning configuration object to define performance-related settings for the supervisor and indexing tasks.|No|
Expand Down
13 changes: 13 additions & 0 deletions docs/multi-stage-query/known-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,3 +68,16 @@ properties, and the `indexSpec` [`tuningConfig`](../ingestion/ingestion-spec.md#
- The maximum number of elements in a window cannot exceed a value of 100,000.
- To avoid `leafOperators` in MSQ engine, window functions have an extra scan stage after the window stage for cases
where native engine has a non-empty `leafOperator`.

## Automatic compaction

<!-- If you update this list, also update data-management/automatic-compaction.md -->

The following known issues and limitations affect automatic compaction with the MSQ task engine:

- The `metricSpec` field is only supported for certain aggregators. For more information, see [Supported aggregators](../data-management/automatic-compaction.md#supported-aggregators).
- Only dynamic and range-based partitioning are supported.
- Set `rollup` to `true` if and only if `metricSpec` is not empty or null.
- You can only partition on string dimensions. However, multi-valued string dimensions are not supported.
- The `maxTotalRows` config is not supported in `DynamicPartitionsSpec`. Use `maxRowsPerSegment` instead.
- Segments can only be sorted on `__time` as the first column.
1 change: 1 addition & 0 deletions website/.spelling
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,7 @@ maxNumSegments
max_map_count
memcached
mergeable
mergeability
metadata
metastores
millis
Expand Down

0 comments on commit d1b81f3

Please sign in to comment.