Skip to content

Commit

Permalink
GITBOOK-1930: add Pinot storage model doc
Browse files Browse the repository at this point in the history
  • Loading branch information
kelseiv authored and gitbook-bot committed Mar 26, 2024
1 parent 9e644bc commit 0f5173d
Show file tree
Hide file tree
Showing 42 changed files with 191 additions and 129 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,6 @@ For a conceptual overview that explains how Pinot works, check out the Concepts

To understand the distributed systems architecture that explains Pinot's operating model, take a look at our basic architecture section:

{% content-ref url="basics/concepts/architecture.md" %}
[architecture.md](basics/concepts/architecture.md)
{% content-ref url="basics/architecture.md" %}
[architecture.md](basics/architecture.md)
{% endcontent-ref %}
33 changes: 17 additions & 16 deletions SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,23 @@
## Basics

* [Concepts](basics/concepts/README.md)
* [Architecture](basics/concepts/architecture.md)
* [Components](basics/concepts/components/README.md)
* [Cluster](basics/concepts/components/cluster/README.md)
* [Tenant](basics/concepts/components/cluster/tenant.md)
* [Server](basics/concepts/components/cluster/server.md)
* [Controller](basics/concepts/components/cluster/controller.md)
* [Broker](basics/concepts/components/cluster/broker.md)
* [Minion](basics/concepts/components/cluster/minion.md)
* [Table](basics/concepts/components/table/README.md)
* [Segment](basics/concepts/components/table/segment/README.md)
* [Deep Store](basics/concepts/components/table/segment/deep-store.md)
* [Segment threshold](basics/concepts/components/table/segment/segment-threshold.md)
* [Segment retention](basics/concepts/components/table/segment/segment-retention.md)
* [Schema](basics/concepts/components/table/schema.md)
* [Time boundary](basics/concepts/components/table/time-boundary.md)
* [Pinot Data Explorer](basics/concepts/components/exploring-pinot.md)
* [Pinot storage model](basics/concepts/pinot-storage-model.md)
* [Architecture](basics/architecture.md)
* [Components](basics/components/README.md)
* [Cluster](basics/components/cluster/README.md)
* [Tenant](basics/components/cluster/tenant.md)
* [Server](basics/components/cluster/server.md)
* [Controller](basics/components/cluster/controller.md)
* [Broker](basics/components/cluster/broker.md)
* [Minion](basics/components/cluster/minion.md)
* [Table](basics/components/table/README.md)
* [Segment](basics/components/table/segment/README.md)
* [Deep Store](basics/components/table/segment/deep-store.md)
* [Segment threshold](basics/concepts/segment-threshold.md)
* [Segment retention](basics/concepts/segment-retention.md)
* [Schema](basics/components/table/schema.md)
* [Time boundary](basics/concepts/time-boundary.md)
* [Pinot Data Explorer](basics/components/exploring-pinot.md)
* [Getting Started](basics/getting-started/README.md)
* [Running Pinot locally](basics/getting-started/running-pinot-locally.md)
* [Running Pinot in Docker](basics/getting-started/running-pinot-in-docker.md)
Expand Down
24 changes: 12 additions & 12 deletions basics/concepts/architecture.md → basics/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ description: >-
# Architecture

{% hint style="info" %}
We recommend that you read [Basic Concepts](./) to better understand the terms used in this guide.
We recommend that you read [Basic Concepts](concepts/) to better understand the terms used in this guide.
{% endhint %}

Apache Pinot™ is a distributed OLAP database designed to serve real-time, user-facing use cases, which means handling large volumes of data and many concurrent queries with very low query latencies. Pinot supports the following requirements:
Expand All @@ -29,14 +29,14 @@ To accommodate large data volumes with stringent latency and concurrency require

## Core components

As described in [Apache Pinot™ Concepts](./), Pinot has four node types:
As described in [Apache Pinot™ Concepts](concepts/), Pinot has four node types:

* [Controller](components/cluster/controller.md)
* [Broker](components/cluster/broker.md)
* [Server](components/cluster/server.md)
* [Minion](components/cluster/minion.md)

![](../../.gitbook/assets/Pinot-architecture.svg)
![](../.gitbook/assets/Pinot-architecture.svg)

### Apache Helix and ZooKeeper

Expand Down Expand Up @@ -77,11 +77,11 @@ Helix uses ZooKeeper to maintain cluster state. ZooKeeper sends Helix spectators

Zookeeper, as a first-class citizen of a Pinot cluster, may use the well-known `ZNode` structure for operations and troubleshooting purposes. Be advised that this structure can change in future Pinot releases.

![Pinot's Zookeeper Browser UI](../../.gitbook/assets/.unused/Zookeeper\_UI.png)
![Pinot's Zookeeper Browser UI](../.gitbook/assets/.unused/Zookeeper\_UI.png)

### Controller

The Pinot [controller](components/cluster/controller.md) schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of [real-time tables](../data-import/pinot-stream-ingestion/) and [offline tables](../data-import/batch-ingestion/)). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.
The Pinot [controller](components/cluster/controller.md) schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of [real-time tables](data-import/pinot-stream-ingestion/) and [offline tables](data-import/batch-ingestion/)). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.

#### Fault tolerance

Expand All @@ -95,7 +95,7 @@ The controller provides a REST interface that allows read and write access to al

The [broker's](components/cluster/broker.md) responsibility is to route queries to the appropriate [server](components/cluster/server.md) instances, or in the case of multi-stage queries, to compute a complete query plan and distribute it to the servers required to execute it. The broker collects and merges the responses from all servers into a final result, then sends the result back to the requesting client. The broker exposes an HTTP endpoint that accepts SQL queries in JSON format and returns the response in JSON.

Each broker maintains a query routing table. The routing table maps segments to the servers that store them. (When replication is configured on a table, each segment is stored on more than one server.) The broker computes multiple routing tables depending on the configured [routing](../operators/operating-pinot/tuning/routing.md) strategy for a table. The default strategy is to balance the query load across all available servers.
Each broker maintains a query routing table. The routing table maps segments to the servers that store them. (When replication is configured on a table, each segment is stored on more than one server.) The broker computes multiple routing tables depending on the configured [routing](operators/operating-pinot/tuning/routing.md) strategy for a table. The default strategy is to balance the query load across all available servers.

{% hint style="info" %}
Advanced routing strategies are available, such as replica-aware routing, partition-based routing, and minimal server selection routing.
Expand All @@ -122,7 +122,7 @@ Advanced routing strategies are available, such as replica-aware routing, partit
Every query processed by a broker uses the single-stage engine or the [multi-stage engine](https://docs.pinot.apache.org/reference/multi-stage-engine). For single-stage queries, the broker does the following:

* Computes query routes based on the routing strategy defined in the [table](components/table/) configuration.
* Computes the list of segments to query on each [server](components/cluster/server.md). (See [routing](../operators/operating-pinot/tuning/routing.md) for further details on this process.)
* Computes the list of segments to query on each [server](components/cluster/server.md). (See [routing](operators/operating-pinot/tuning/routing.md) for further details on this process.)
* Sends the query to each of those servers for local execution against their segments.
* Receives the results from each server and merges them.
* Sends the query result to the client.
Expand Down Expand Up @@ -179,15 +179,15 @@ Offline servers host segments created by ingesting batch data. The controller wr

Because offline tables tend to have long retention periods, offline servers tend to scale based on the size of the data they store.

![](../../.gitbook/assets/OfflineServer.jpg)
![](../.gitbook/assets/OfflineServer.jpg)

#### Real-time servers

Real-time servers ingest data from streaming sources, like Apache Kafka®, Apache Pulsar®, or AWS Kinesis. Streaming data ends up in conventional segment files just like batch data, but is first accumulated in an in-memory data structure known as a consuming segment. Each message consumed from a streaming source is written immediately to the relevant consuming segment, and is available for query processing from the consuming segment immediately, since consuming segments participate in query processing as first-class citizens. Consuming segments get flushed to disk periodically based on a completion threshold, which can be calculated by row count, ingestion time, or segment size. A flushed segment on a real-time table is called a _completed_ segment, and is functionally equivalent to a segment created during offline ingest.

Real-time servers tend to be scaled based on the rate at which they ingest streaming data.

![](../../.gitbook/assets/real-time-flow.svg)
![](../.gitbook/assets/real-time-flow.svg)

### Minion

Expand All @@ -201,9 +201,9 @@ Pinot [tables](components/table/) exist in two varieties: offline (or batch) and

### Offline (batch) ingest

![](../../.gitbook/assets/OfflineServer.jpg)
![](../.gitbook/assets/OfflineServer.jpg)

Pinot ingests batch data using an [ingestion job](../data-import/batch-ingestion/), which follows a process like this:
Pinot ingests batch data using an [ingestion job](data-import/batch-ingestion/), which follows a process like this:

1. The job transforms a raw data source (such as a CSV file) into [segments](components/table/segment/). This is a potentially complex process resulting in a file that is typically several hundred megabytes in size.
2. The job then transfers the file to the cluster's [deep store](components/table/segment/deep-store.md) and notifies the [controller](components/cluster/controller.md) that a new segment exists.
Expand All @@ -225,4 +225,4 @@ Ingestion is established at the time a real-time table is created, and continues
7. The controller and the server create a new consuming segment to continue real-time ingestion.
8. The controller marks the newly committed segment as online. Brokers then discover the new segment through the Helix notification mechanism, allowing them to route queries to it in the usual fashion.

![](../../.gitbook/assets/real-time-flow.svg)
![](../.gitbook/assets/real-time-flow.svg)
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Apache Pinot™ is a database designed to deliver highly concurrent, ultra-low-l

Learn about the major components and logical abstractions used in Pinot.

![](../../../.gitbook/assets/pinot-system-architecture.png)
![](../../.gitbook/assets/pinot-system-architecture.png)

#### Operator reference

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ description: >-

# Cluster

A Pinot [_cluster_](../../../components/cluster/components/cluster/) is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see [Physical architecture](./#physical-architecture).
A Pinot [_cluster_](components/cluster/) is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see [Physical architecture](./#physical-architecture).

A Pinot cluster consists of the following processes, which are typically deployed on separate hardware resources in production. In development, they can fit comfortably into Docker containers on a typical laptop:

Expand Down Expand Up @@ -56,13 +56,13 @@ Another way to visualize the cluster is a logical view, where:
* Tenants contain [tables](../table/)
* Tables contain [segments](../table/segment/)

![](../../../../.gitbook/assets/ClusterLogical.jpg)
![](../../../.gitbook/assets/ClusterLogical.jpg)

## Set up a Pinot cluster

Typically, there is only one cluster per environment/data center. There is no need to create multiple Pinot clusters because Pinot supports [tenants](tenant.md).

To set up a cluster, see one of the following guides:

* [Running Pinot in Docker](../../../getting-started/running-pinot-in-docker.md)
* [Running Pinot locally](../../../getting-started/running-pinot-locally.md)
* [Running Pinot in Docker](../../getting-started/running-pinot-in-docker.md)
* [Running Pinot locally](../../getting-started/running-pinot-locally.md)
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Pinot brokers take query requests from client processes, scatter them to applica

A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.

![Broker interaction with other components](../../../../.gitbook/assets/broker-diagram.jpg)
![Broker interaction with other components](../../../.gitbook/assets/broker-diagram.jpg)

Pinot brokers are modeled as Helix **spectators**. They need to know the location of each segment of a table (and each replica of the segments) and route requests to the appropriate server that hosts the segments of the table being queried.

Expand All @@ -22,7 +22,7 @@ In the case of hybrid tables, the brokers ensure that the overlap between real-t

Let's take this example, we have real-time data for five days - March 23 to March 27, and offline data has been pushed until Mar 25, which is two days behind real-time. The brokers maintain this time boundary.

![](../../../../.gitbook/assets/broker-time-boundary-diagram.jpg)
![](../../../.gitbook/assets/broker-time-boundary-diagram.jpg)

Suppose, we get a query to this table : `select sum(metric) from table`. The broker will split the query into 2 queries based on this time boundary – one for offline and one for real-time. This query becomes `select sum(metric) from table_REALTIME where date >= Mar 25`\
and `select sum(metric) from table_OFFLINE where date < Mar 25`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ description: >-

# Controller

The Pinot controller schedules and reschedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, the Pinot controller schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (for example, ingest of [real-time tables](../../../../configuration-reference/table.md#real-time-table-config) and [offline tables](../../../../configuration-reference/table.md#offline-table)). The Pinot controller can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.
The Pinot controller schedules and reschedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, the Pinot controller schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (for example, ingest of [real-time tables](../../../configuration-reference/table.md#real-time-table-config) and [offline tables](../../../configuration-reference/table.md#offline-table)). The Pinot controller can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.

The controller exposes a [REST API endpoint](../../../../users/api/controller-api-reference.md) for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.
The controller exposes a [REST API endpoint](../../../users/api/controller-api-reference.md) for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.

The Pinot controller is responsible for the following:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ bin/pinot-admin.sh StartMinion \

## Interfaces

![](../../../../.gitbook/assets/task-interface-diagram.png)
![](../../../.gitbook/assets/task-interface-diagram.png)

### Pinot task generator

Expand Down Expand Up @@ -233,11 +233,11 @@ When performing ingestion at scale remember that Pinot will list all of the file

### RealtimeToOfflineSegmentsTask

See [Pinot managed Offline flows](../../../../operators/operating-pinot/pinot-managed-offline-flows.md) for details.
See [Pinot managed Offline flows](../../../operators/operating-pinot/pinot-managed-offline-flows.md) for details.

### MergeRollupTask

See [Minion merge rollup task](../../../../operators/operating-pinot/minion-merge-rollup-task.md) for details.
See [Minion merge rollup task](../../../operators/operating-pinot/minion-merge-rollup-task.md) for details.

## Enable tasks

Expand Down Expand Up @@ -330,27 +330,27 @@ In the Pinot UI, there is **Minion Task Manager** tab under **Cluster Manager**

This one shows which types of Minion Task have been used. Essentially which task types have created their task queues in Helix.

<figure><img src="../../../../.gitbook/assets/minion-task-manager-overview.png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/minion-task-manager-overview.png" alt=""><figcaption></figcaption></figure>

Clicking into a task type, one can see the tables using that task. And a few buttons to stop the task queue, cleaning up ended tasks etc.

<figure><img src="../../../../.gitbook/assets/minion-task-manager-task-details.png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/minion-task-manager-task-details.png" alt=""><figcaption></figcaption></figure>

Then clicking into any table in this list, one can see how the task is configured for that table. And the task metadata if there is one in ZK. For example, MergeRollupTask tracks a watermark in ZK. If the task is cron scheduled, the current and next schedules are also shown in this page like below.

<figure><img src="../../../../.gitbook/assets/minion-rollup-task-scheduled.png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/minion-rollup-task-scheduled.png" alt=""><figcaption></figcaption></figure>

<figure><img src="../../../../.gitbook/assets/minion-rollup-task-completed.png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/minion-rollup-task-completed.png" alt=""><figcaption></figcaption></figure>

At the bottom of this page is a list of tasks generated for this table for this specific task type. Like here, one MergeRollup task has been generated and completed.

Clicking into a task from that list, we can see start/end time for it, and the subtasks generated for that task (as context, one minion task can have multiple subtasks to process data in parallel). In this example, it happened to have one sub-task here, and it shows when it starts and stops and which minion worker it's running.

<figure><img src="../../../../.gitbook/assets/minion-task-manager-subtask.png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/minion-task-manager-subtask.png" alt=""><figcaption></figcaption></figure>

Clicking into this subtask, one can see more details about it like the input task configs and error info if the task failed.

<figure><img src="../../../../.gitbook/assets/minion-task-manager-subtask-config.png" alt=""><figcaption></figcaption></figure>
<figure><img src="../../../.gitbook/assets/minion-task-manager-subtask-config.png" alt=""><figcaption></figcaption></figure>

## Task-related metrics

Expand Down
Loading

0 comments on commit 0f5173d

Please sign in to comment.