Skip to content

Commit

Permalink
chore: update logs data format (GreptimeTeam#1066)
Browse files Browse the repository at this point in the history
Co-authored-by: Yiran <cuiyiran3@gmail.com>
  • Loading branch information
shuiyisong and nicecui committed Jul 18, 2024
1 parent 621f102 commit ecabf89
Show file tree
Hide file tree
Showing 12 changed files with 376 additions and 74 deletions.
26 changes: 13 additions & 13 deletions docs/nightly/en/user-guide/logs/pipeline-config.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Pipeline Configuration

Pipeline is a mechanism in GreptimeDB for transforming log data. It consists of a unique name and a set of configuration rules that define how log data is formatted, split, and transformed. Currently, we support JSON (`application/json`) and plain text (`text/plain`) formats as input for log data.
Pipeline is a mechanism in GreptimeDB for parsing and transforming log data. It consists of a unique name and a set of configuration rules that define how log data is formatted, split, and transformed. Currently, we support JSON (`application/json`) and plain text (`text/plain`) formats as input for log data.

These configurations are provided in YAML format, allowing the Pipeline to process data during the log writing process according to the defined rules and store the processed data in the database for subsequent structured queries.

## The overall structure
## Overall structure

Pipeline consists of two parts: Processors and Transform, both of which are in array format. A Pipeline configuration can contain multiple Processors and multiple Transforms. The data type described by Transform determines the table structure when storing log data in the database.

- Processors are used for preprocessing log data, such as parsing time fields and replacing fields.
- Transform is used for converting log data formats, such as converting string types to numeric types.
- Transform is used for converting data formats, such as converting string types to numeric types.

Here is an example of a simple configuration that includes Processors and Transform:

Expand Down Expand Up @@ -40,15 +40,15 @@ The Processor is used for preprocessing log data, and its configuration is locat

We currently provide the following built-in Processors:

- `date`: Used to parse formatted time string fields, such as `2024-07-12T16:18:53.048`.
- `epoch`: Used to parse numeric timestamp fields, such as `1720772378893`.
- `dissect`: Used to split log data fields.
- `gsub`: Used to replace log data fields.
- `join`: Used to merge array-type fields in logs.
- `letter`: Used to convert log data fields to letters.
- `regex`: Used to perform regular expression matching on log data fields.
- `urlencoding`: Used to perform URL encoding/decoding on log data fields.
- `csv`: Used to parse CSV data fields in logs.
- `date`: parses formatted time string fields, such as `2024-07-12T16:18:53.048`.
- `epoch`: parses numeric timestamp fields, such as `1720772378893`.
- `dissect`: splits log data fields.
- `gsub`: replaces log data fields.
- `join`: merges array-type fields in logs.
- `letter`: converts log data fields to letters.
- `regex`: performs regular expression matching on log data fields.
- `urlencoding`: performs URL encoding/decoding on log data fields.
- `csv`: parses CSV data fields in logs.

### `date`

Expand All @@ -68,7 +68,7 @@ processors:
In the above example, the configuration of the `date` processor includes the following fields:

- `fields`: A list of time field names to be parsed.
- `formats`: Time format strings, supporting multiple format strings. Parsing is attempted in the order provided until successful.
- `formats`: Time format strings, supporting multiple format strings. Parsing is attempted in the order provided until successful. You can find reference [here](https://docs.rs/chrono/latest/chrono/format/strftime/index.html) for formatting syntax.
- `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.
- `timezone`: Time zone. Use the time zone identifiers from the [tz_database](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) to specify the time zone. Defaults to `UTC`.

Expand Down
6 changes: 3 additions & 3 deletions docs/nightly/en/user-guide/logs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ curl -X "POST" "http://localhost:4000/v1/events/pipelines/nginx_pipeline" -F "fi

After successfully executing this command, a pipeline named `nginx_pipeline` will be created, and the result will be returned as:

```shell
```json
{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}.
```

Expand All @@ -126,7 +126,7 @@ curl -X "POST" "http://localhost:4000/v1/events/logs?db=public&table=pipeline_lo

You will see the following output if the command is successful:

```shell
```json
{"output":[{"affectedrows":4}],"execution_time_ms":79}
```

Expand Down Expand Up @@ -182,7 +182,7 @@ Of course, if you need keyword searching within large text blocks, you must use

## Query logs

The `pipeline_logs` as the example to query logs.
We use the `pipeline_logs` table as an example to query logs.

### Query logs by tags

Expand Down
81 changes: 78 additions & 3 deletions docs/nightly/en/user-guide/logs/write-logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ curl -X "POST" "http://localhost:4000/v1/events/logs?db=<db-name>&table=<table-n
-d "$<log-items>"
```

## Query parameters
## Request parameters

This interface accepts the following parameters:

Expand All @@ -23,9 +23,84 @@ This interface accepts the following parameters:
- `pipeline_name`: The name of the [pipeline](./pipeline-config.md).
- `version`: The version of the pipeline. Optional, default use the latest one.

## Body data format
## `Content-Type` and body format

The request body supports NDJSON and JSON Array formats, where each JSON object represents a log entry.
GreptimeDB uses `Content-Type` header to decide how to decode the payload body. Currently the following two format is supported:
- `application/json`: this includes normal JSON format and NDJSON format.
- `text/plain`: multiple log lines separated by line breaks.

### `application/json` format

Here is an example of JSON format body payload

```JSON
[
{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""},
{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""},
{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""},
{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""}
]
```

Note the whole JSON is an array (log lines). Each JSON object represents one line to be processed by Pipeline engine.

The name of the key in JSON objects, which is `message` here, is used as field name in Pipeline processors. For example:

```yaml
processors:
- dissect:
fields:
# `message` is the key in JSON object
- message
patterns:
- '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"'
ignore_missing: true

# rest of the file is ignored
```

We can also rewrite the payload into NDJSON format like following:

```JSON
{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""}
{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""}
{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""}
{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""}
```

Note the outer array is eliminated, and lines are separated by line breaks instead of `,`.

### `text/plain` format

Log in plain text format is widely used throughout the ecosystem. GreptimeDB also supports `text/plain` format as log data input, enabling ingesting logs first hand from log producers.

The equivalent body payload of previous example is like following:

```plain
127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"
```

Sending log ingestion request to GreptimeDB requires only modifying the `Content-Type` header to be `text/plain`, and you are good to go!

Please note that, unlike JSON format, where the input data already have key names as field names to be used in Pipeline processors, `text/plain` format just gives the whole line as input to the Pipeline engine. In this case we use `line` as the field name to refer to the input line, for example:

```yaml
processors:
- dissect:
fields:
# use `line` as the field name
- line
patterns:
- '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"'
ignore_missing: true

# rest of the file is ignored
```

It is recommended to use `dissect` or `regex` processor to split the input line into fields first and then process the fields accordingly.

## Example

Expand Down
24 changes: 12 additions & 12 deletions docs/nightly/zh/user-guide/logs/pipeline-config.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Pipeline 配置

Pipeline 是 GreptimeDB 中对 log 数据进行转换的一种机制, 由一个唯一的名称和一组配置规则组成,这些规则定义了如何对日志数据进行格式化、拆分和转换。目前我们支持 JSON(`application/json`)和纯文本(`text/plain`)格式的日志数据作为输入。
Pipeline 是 GreptimeDB 中对 log 数据进行解析和转换的一种机制, 由一个唯一的名称和一组配置规则组成,这些规则定义了如何对日志数据进行格式化、拆分和转换。目前我们支持 JSON(`application/json`)和纯文本(`text/plain`)格式的日志数据作为输入。

这些配置以 YAML 格式提供,使得 Pipeline 能够在日志写入过程中,根据设定的规则对数据进行处理,并将处理后的数据存储到数据库中,便于后续的结构化查询。

Expand All @@ -9,7 +9,7 @@ Pipeline 是 GreptimeDB 中对 log 数据进行转换的一种机制, 由一
Pipeline 由两部分组成:Processors 和 Transform,这两部分均为数组形式。一个 Pipeline 配置可以包含多个 Processor 和多个 Transform。Transform 所描述的数据类型会决定日志数据保存到数据库时的表结构。

- Processor 用于对 log 数据进行预处理,例如解析时间字段,替换字段等。
- Transform 用于对 log 数据进行格式转换,例如将字符串类型转换为数字类型。
- Transform 用于对数据进行格式转换,例如将字符串类型转换为数字类型。

一个包含 Processor 和 Transform 的简单配置示例如下:

Expand Down Expand Up @@ -42,15 +42,15 @@ Processor 由一个 name 和多个配置组成,不同类型的 Processor 配

我们目前内置了以下几种 Processor:

- `date`: 用于解析格式化的时间字符串字段,例如 `2024-07-12T16:18:53.048`。
- `epoch`: 用于解析数字时间戳字段,例如 `1720772378893`。
- `dissect`: 用于对 log 数据字段进行拆分。
- `gsub`: 用于对 log 数据字段进行替换。
- `join`: 用于对 log 中的 array 类型字段进行合并。
- `letter`: 用于对 log 数据字段进行字母转换。
- `regex`: 用于对 log 数据字段进行正则匹配。
- `urlencoding`: 用于对 log 数据字段进行 URL 编解码。
- `csv`: 用于对 log 数据字段进行 CSV 解析。
- `date`: 解析格式化的时间字符串字段,例如 `2024-07-12T16:18:53.048`。
- `epoch`: 解析数字时间戳字段,例如 `1720772378893`。
- `dissect`: log 数据字段进行拆分。
- `gsub`: log 数据字段进行替换。
- `join`: log 中的 array 类型字段进行合并。
- `letter`: log 数据字段进行字母转换。
- `regex`: log 数据字段进行正则匹配。
- `urlencoding`: log 数据字段进行 URL 编解码。
- `csv`: log 数据字段进行 CSV 解析。

### `date`

Expand All @@ -70,7 +70,7 @@ processors:
如上所示,`date` Processor 的配置包含以下字段:

- `fields`: 需要解析的时间字段名列表。
- `formats`: 时间格式化字符串,支持多个时间格式化字符串。按照提供的顺序尝试解析,直到解析成功。
- `formats`: 时间格式化字符串,支持多个时间格式化字符串。按照提供的顺序尝试解析,直到解析成功。你可以在[这里](https://docs.rs/chrono/latest/chrono/format/strftime/index.html)找到格式化的语法说明。
- `ignore_missing`: 忽略字段不存在的情况。默认为 `false`。如果字段不存在,并且此配置为 false,则会抛出异常。
- `timezone`: 时区。使用[tz_database](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) 中的时区标识符来指定时区。默认为 `UTC`。

Expand Down
6 changes: 3 additions & 3 deletions docs/nightly/zh/user-guide/logs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ curl -X "POST" "http://localhost:4000/v1/events/pipelines/nginx_pipeline" -F "fi

成功执行此命令后,将创建一个名为 `nginx_pipeline` 的 pipeline,返回的结果如下:

```shell
```json
{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}.
```

Expand All @@ -124,7 +124,7 @@ curl -X "POST" "http://localhost:4000/v1/events/logs?db=public&table=pipeline_lo

如果命令执行成功,您将看到以下输出:

```shell
```json
{"output":[{"affectedrows":4}],"execution_time_ms":79}
```

Expand Down Expand Up @@ -179,7 +179,7 @@ DESC pipeline_logs;

## 查询日志

以 `pipeline_logs` 为例查询日志
以 `pipeline_logs` 表为例查询日志

### 按 Tag 查询日志

Expand Down
82 changes: 79 additions & 3 deletions docs/nightly/zh/user-guide/logs/write-logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ curl -X "POST" "http://localhost:4000/v1/events/logs?db=<db-name>&table=<table-n
```


## Query 参数
## 请求参数

此接口接受以下参数:

Expand All @@ -24,10 +24,86 @@ curl -X "POST" "http://localhost:4000/v1/events/logs?db=<db-name>&table=<table-n
- `pipeline_name`[Pipeline](./pipeline-config.md) 名称。
- `version`:Pipeline 版本号。可选,默认使用最新版本。

## Body 数据格式
## `Content-Type`Body 数据格式

请求体支持 NDJSON 和 JSON Array 格式,其中每个 JSON 对象代表一条日志记录。
GreptimeDB 使用 `Content-Type` header 来决定如何解码请求体内容。目前我们支持以下两种格式:
- `application/json`: 包括普通的 JSON 格式和 NDJSON 格式。
- `text/plain`: 通过换行符分割的多行日志文本行。

### `application/json` 格式

以下是一份 JSON 格式请求体内容的示例:

```JSON
[
{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""},
{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""},
{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""},
{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""}
]
```

请注意整个 JSON 是一个数组(包含多行日志)。每个 JSON 对象代表即将要被 Pipeline 引擎处理的一行日志。

JSON 对象中的 key 名,也就是这里的 `message`,会被用作 Pipeline processor 处理时的 field 名称。比如:

```yaml
processors:
- dissect:
fields:
# `message` 是 JSON 对象中的 key 名
- message
shuiyisong marked this conversation as resolved.
patterns:
- '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"'
ignore_missing: true

# pipeline 文件的剩余部分在这里省略
```

我们也可以将这个请求体内容改写成 NDJSON 的格式,如下所示:

```JSON
{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""}
{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""}
{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""}
{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""}
```

注意到最外层的数组符被消去了,现在每个 JSON 对象通过换行符分割而不是 `,`

### `text/plain` 格式

纯文本日志在整个生态系统中被广泛应用。GreptimeDB 同样支持日志数据以 `text/plain` 格式进行输入,使得我们可以直接从日志产生源进行写入。

以下是一份和上述样例请求体内容等价的文本请求示例:

```plain
127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"
```

仅需要将 `Content-Type` header 设置成 `text/plain`,即可将纯文本请求发送到 GreptimeDB。

主要注意的是,和 JSON 格式自带 key 名可以被 Pipeline processor 识别和处理不同,`text/plain` 格式直接将整行文本输入到 Pipeline engine。在这种情况下我们可以使用 `line` 来指代整行输入文本,例如:

```yaml
processors:
- dissect:
fields:
# 使用 `line` 作为 field 名称
- line
shuiyisong marked this conversation as resolved.
patterns:
- '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"'
ignore_missing: true

# pipeline 文件的剩余部分在这里省略
```

对于 `text/plain` 格式的输入,推荐首先使用 `dissect` 或者 `regex` processor 将整行文本分割成不同的字段,以便进行后续的处理。

## 示例

Expand Down
Loading

0 comments on commit ecabf89

Please sign in to comment.