Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: concepts updates for v0.9 #1047

Merged
merged 14 commits into from
Jul 22, 2024
16 changes: 8 additions & 8 deletions docs/nightly/en/user-guide/concepts/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,19 @@ level, there are three main components in GreptimeDB architecture: Datanode,
Frontend and Metasrv.

- [**Metasrv**](/contributor-guide/metasrv/overview.md) is the central command of
GreptimeDB cluster. In typical deployment, at least three nodes is required to
GreptimeDB cluster. In a typical cluster deployment, at least three nodes is required to
setup a reliable _Metasrv_ mini-cluster. _Meta_ manages database and table
information, including how data spread across the cluster and where to route
requests to. It also keeps monitoring availability and performance of \_Datanode_s,
to ensure its routing table is valid and up-to-date.
- [**Frontend**](/contributor-guide/frontend/overview.md) is a stateless
component that can scale to as many as needed. It accepts incoming request,
authenticates it, translates it from various protocols into GreptimeDB
cluster's internal one, and forwards to certain \_Datanode_s under guidance from Metasrv.
component that can scale to as many as needed. It accepts incoming requests,
authenticates them, translates them from various protocols into GreptimeDB
internal gRPC, and forwards to certain \_Datanode_s under guidance from Metasrv by table sharding.
- [**Datanodes**](/contributor-guide/datanode/overview.md) hold regions of
tables in Greptime DB cluster. It accepts read and write request sent
from _Frontend_, and executes it against its data.
tables in Greptime DB cluster. It accepts read and write requests sent
from _Frontend_, executes them against its data, and returns the handle results.

These three components will be combined as GreptimeDB standalone mode, for local or embedded development.
These three components will be combined in a single binary as GreptimeDB standalone mode, for local or embedded development.

You can refer to [architecture](/contributor-guide/overview.md) in developer guide to learn more details about how components work together.
You can refer to [architecture](/contributor-guide/overview.md) in contributor guide to learn more details about how components work together.
76 changes: 57 additions & 19 deletions docs/nightly/en/user-guide/concepts/data-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,45 @@
## Model

GreptimeDB uses the time-series table to guide the organization, compression, and expiration management of data.
The data model mainly based on the table model in relational databases while considering the characteristics of time-series data.
The data model is mainly based on the table model in relational databases while considering the characteristics of metrics, logs and events data.

All data in GreptimeDB is organized into tables with names. Each data item in a table consists of three types of columns: `Tag`, `Timestamp`, and `Field`.

- Table names are often the same as the indicator names or metric names.
- Table names are often the same as the indicator names, log source names, or metric names.
- `Tag` columns store metadata that is commonly queried.
The values in `Tag` columns are labels attached to the collected indicators,
generally used to describe a particular characteristic of these indicators.
The values in `Tag` columns are labels attached to the collected sources,
generally used to describe a particular characteristic of these sources.
`Tag` columns are indexed, making queries on tags performant.
- `Timestamp` is the root of a time-series database.
- `Timestamp` is the root of a metrics, logs and events database.
It represents the date and time when the data was generated.
Timestamps are indexed, making queries on timestamps performant.
A table can only have one timestamp column.
A table can only have one timestamp column, which is called time index.
- The other columns are `Field` columns.
Fields contain the data indicators that are collected.
These indicators are generally numerical values
but may also be other types of data, such as strings or geographic locations.
Fields are not indexed,
and queries on field values scan all data in the table.
This can be resource-intensive and unperformant.
Fields contain the data indicators or log contents that are collected.
These fields are generally numerical values or string values,
but may also be other types of data, such as geographic locations.
Fields are not indexed by default,
and queries on field values scan all data in the table. It can be resource-intensive and underperformant.
However, the string field can turn on the [full-text index](/user-guide/logs/query-logs#full-text-index-for-accelerated-search) to speed up queries such as log searching.

Suppose we have a time-series table called `system_metrics` that monitors the resource usage of a standalone device. The data model for this table is as follows:
### Metric Table

Suppose we have a time-series table called `system_metrics` that monitors the resource usage of a standalone device:

```sql
CREATE TABLE IF NOT EXISTS system_metrics (
host STRING,
idc STRING,
cpu_util DOUBLE,
memory_util DOUBLE,
disk_util DOUBLE,
ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY(host, idc),
TIME INDEX(ts)
);
```

The data model for this table is as follows:

![time-series-table-model](/time-series-data-model.svg)

Expand All @@ -39,22 +56,43 @@ Those are very similar to the table model everyone is familiar with. The differe
- The `cpu_util`, `memory_util`, `disk_util`, and `load` columns in the `Field` columns represent
the CPU utilization, memory utilization, disk utilization, and load of the machine, respectively.
These columns contain the actual data and are not indexed, but they can be efficiently computed and evaluated, such as the latest value, maximum/minimum value, average, percentage, and so on. Please avoid using `Field` columns in query conditions,
which is highly resource-intensive and unperformant.
which is highly resource-intensive and underperformant.

### Log Table
Another example is creating a log table for access logs:

```sql
CREATE TABLE access_logs (
access_time TIMESTAMP TIME INDEX,
remote_addr STRING,
http_status STRING,
http_method STRING,
http_refer STRING,
user_agent STRING,
request STRING FULLTEXT,
PRIMARY KEY (remote_addr, http_status, http_method, http_refer, user_agent)
)
```

- The time index column is `access_time`.
- `remote_addr`, `http_status`, `http_method`, `http_refer` and `user_agent` are tags.
- `request` is a field that enables full-text index by the [`FULLTEXT` column option](/reference/sql/create#fulltext-column-option).

To learn how to indicate `Tag`, `Timestamp`, and `Field` columns, Please refer to [table management](../table-management.md#create-a-table) and [CREATE statement](/reference/sql/create.md).

Of course, you can place metrics and logs in a single table at any time, which is also a key capability provided by GreptimeDB.

## Design Considerations

GreptimeDB is designed on top of Table for the following reasons:

- The Table model has a broad group of users and it's easy to learn, that we just introduced the concept of time index to the time series.
- The Table model has a broad group of users and it's easy to learn, that we just introduced the concept of time index to the metrics, logs and events.
- Schema is meta-data to describe data characteristics, and it's more convenient for users to manage and maintain. By introducing the concept of schema version, we can better manage data compatibility.
- Schema brings enormous benefits for optimizing storage and computing with its information like types, lengths, etc., on which we could conduct targeted optimizations.
- When we have the Table model, it's natural for us to introduce SQL and use it to process association analysis and aggregation queries between various index tables, offsetting the learning and use costs for users.
- Use a multi-value model where a row of data can have multiple metric columns,
- When we have the Table model, it's natural for us to introduce SQL and use it to process association analysis and aggregation queries between various tables, offsetting the learning and use costs for users.
- Use a multi-value model where a row of data can have multiple field columns,
instead of the single-value model adopted by OpenTSDB and Prometheus.
The multi-value model is used to model data sources, where a metric can have multiple values represented by fields.
The advantage of the multi-value model is that it can write multiple values to the database at once,
while the single-value model requires splitting the data into multiple records.
The advantage of the multi-value model is that it can write or read multiple values to the database at once, reducing transfer traffic and simplifying queries. In contrast, the single-value model requires splitting the data into multiple records. Read the [blog](https://greptime.com/blogs/2024-05-09-prometheus) for more detailed benefits of multi-value mode.

GreptimeDB uses SQL to manage table schema. Please refer to [table management](../table-management.md) for more information. However, our definition of schema is not mandatory and leans towards a **schemaless** approach, similar to MongoDB. For more details, see [Automatic Schema Generation](../write-data/overview.md#automatic-schema-generation).
25 changes: 21 additions & 4 deletions docs/nightly/en/user-guide/concepts/key-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,18 @@ these building blocks of GreptimeDB.

## Database

Similar to *database* in relational databases, time-series database is the minimal unit of
data container, within which data can be managed and computed.
Similar to *database* in relational databases, a database is the minimal unit of
data container, within which data can be managed and computed. Users can use the database to achieve data isolation, creating a tenant-like effect.

## Time-Series Table

GreptimeDB designed time-series table to be the basic unit of data storage.
It is similar to a table in a traditional relational database, but requires a timestamp column(We call it time index).
The table holds a set of data that shares a common schema.
It is similar to a table in a traditional relational database, but requires a timestamp column(We call it **time index**).
The table holds a set of data that shares a common schema, it's a collection of rows and columns:

* Column: a vertical set of values in a table, GreptimeDB distinguishes columns into time index, tag and field.
* Row: a horizontal set of values in a table.

It can be created using SQL `CREATE TABLE`, or inferred from the input data structure using the auto-schema feature.
In a distributed deployment, a table can be split into multiple partitions that sit on different datanodes.

Expand All @@ -32,3 +36,16 @@ flexibility when creating a table. Once the table is created, data of the same
column must share common data type.

Find all the supported data types in [Data Types](/reference/sql/data-types.md).

## Index

The `index` is a performance-tuning method that allows faster retrieval of records. GreptimeDB uses the [inverted index](/contributor-guide/datanode/data-persistence-indexing#inverted-index) to accelerate queries.

## View

The `view` is a virtual table that is derived from the result set of a SQL query. It contains rows and columns just like a real table, but it doesn’t store any data itself.
The data displayed in a view is retrieved dynamically from the underlying tables each time the view is queried.

## Flow

A `flow` in GreptimeDB refers to a [continuous aggregation](/user-guide/continuous-aggregation/overview) process that continuously updates and materializes aggregated data based on incoming data.
7 changes: 4 additions & 3 deletions docs/nightly/en/user-guide/concepts/overview.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
# Overview

- [Why GreptimeDB](./why-greptimedb.md): This document outlines the features and benefits of GreptimeDB, including its flexible architecture that allows for deployment in various environments, from embedded to cloud-native. GreptimeDB is also cost-effective, high-performance, and user-friendly.
- [Data Model](./data-model.md): This document describes the data model of GreptimeDB, including table schema, index columns, etc.
- [Why GreptimeDB](./why-greptimedb.md): This document outlines the features and benefits of GreptimeDB, including its unified design for metrics, logs and events; Cloud-Native and flexible architecture that allows for deployment in various environments, from embedded to cloud. GreptimeDB is also cost-effective, high-performance, and user-friendly.
- [Data Model](./data-model.md): This document describes the data model of GreptimeDB, including table schema, time index constraint, etc.
- [Architecture](./architecture.md): Get the cloud-native architecture of GreptimeDB.
- [Storage Location](./storage-location.md): This document describes the storage location of GreptimeDB, including local disk, HDFS, and cloud object storage such as S3, Azure Blob Storage, etc.
- [Key Concepts](./key-concepts.md): This document describes the key concepts of GreptimeDB, including table, time index, table region and data types.
- [Features that You Concern](./features-that-you-concern.md): Describes some features that may be concerned about a TSDB.
- [Features that You Concern](./features-that-you-concern.md): Describes some features that may be concerned about a unified metrics, logs & events database.

## Read More

Get GreptimeDB roadmap and architecture design from blog posts:

- [This Time, for Real - GreptimeDB is Now Open Source](https://greptime.com/blogs/2022-11-15-this-time-for-real)
- [Unifying Logs and Metrics](https://greptime.com/blogs/2024-06-25-logs-and-metrics)
- [GreptimeDB Internal Design — Distributed, Cloud-native, and Enhanced Analytical Ability for Time Series](https://greptime.com/blogs/2022-12-08-GreptimeDB-internal-design)
- [GreptimeDB Storage Engine Design - Catering to Time Series Scenarios](https://greptime.com/blogs/2022-12-21-storage-engine-design)
15 changes: 12 additions & 3 deletions docs/nightly/en/user-guide/concepts/why-greptimedb.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
# Why GreptimeDB

GreptimeDB is a cloud-native, distributed and open source Time Series Database (TSDB), it's designed to process, store and analyze vast amounts of time-series data.
GreptimeDB is a cloud-native, distributed and open source time series database, it's designed to process, store and analyze vast amounts of metrics, logs & events data (also Traces in plan).

It's highly efficient at handling hybrid processing workloads which involve both time-series and real-time analysis, while providing users with great experience.
To gain insight into the motivations that led to the development of GreptimeDB, we recommend reading our blog post titled ["This Time, for Real"](https://greptime.com/blogs/2022-11-15-this-time-for-real).
In this article, we delve into the reasons behind Greptime's high performance and some highlighted features.

To gain insight into the motivations that led to the development of GreptimeDB, we recommend reading our blog posts titled ["This Time, for Real"](https://greptime.com/blogs/2022-11-15-this-time-for-real) and ["Unifying Logs and Metrics"](https://greptime.com/blogs/2024-06-25-logs-and-metrics).

In these documents, we delve into the reasons behind Greptime's high performance and some highlighted features.

## Unified metrics, logs and events

Through the model design of [time series tables](./data-model), native support for SQL, and the hybrid workload brought by the storage-computation separation architecture, GreptimeDB can handle metrics, logs, and events together, enhance the correlation analysis between different time series data and simplify the architecture, deployment and APIs for users.

Read the [SQL example](/user-guide/overview#sql-query-example) for detailed info.

## Availability, Scalability, and Elasticity

Expand Down
6 changes: 3 additions & 3 deletions docs/nightly/zh/user-guide/concepts/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@

- [**Metasrv**](/contributor-guide/metasrv/overview.md) 控制着 GreptimeDB 集群的核心命令。在典型的部署结构中,至少需要三个节点才能建立一个可靠的 _Metasrv_ 小集群。_Metasrv_ 管理着数据库和表的信息,包括数据如何在集群中传递、请求的转发地址等。它还负责监测 `Datanode` 的可用性和性能,以确保路由表的最新状态和有效性。

- [**Frontend**](/contributor-guide/frontend/overview.md) 作为无状态的组件,可以根据需求进行伸缩扩容。它负责接收请求并鉴权,将多种协议转化为 GreptimeDB 集群的内部协议,并根据 _Metasrv_ 中的信息将请求转发到相应的 _Datanode_。
- [**Frontend**](/contributor-guide/frontend/overview.md) 作为无状态的组件,可以根据需求进行伸缩扩容。它负责接收请求并鉴权,将多种协议转化为 GreptimeDB 集群的内部 gRPC 协议,并根据 _Metasrv_ 中的表的分片路由信息将请求转发到相应的 _Datanode_。

- [**Datanode**](/contributor-guide/datanode/overview.md) 负责 GreptimeDB 集群中的表的 `region` 数据存储,接收并执行从 _Frontend_ 发来的读写请求。
- [**Datanode**](/contributor-guide/datanode/overview.md) 负责 GreptimeDB 集群中的表的 `region` 数据存储,接收并执行从 _Frontend_ 发来的读写请求,处理查询和写入,并返回对应的结果

通过灵活的架构设计,以上三个组件可以合并打包在一起,支持本地部署下的单机模式,我们称之为 standalone 模式。
通过灵活的架构设计,以上三个组件既可以是集群分布式部署,也可以合并打包在一个二进制包内,支持本地部署下的单机模式,我们称之为 standalone 模式。
Loading
Loading