Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add json path doc #1279

Merged
merged 5 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 138 additions & 2 deletions docs/user-guide/logs/pipeline-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,23 @@ We currently provide the following built-in Processors:
- `urlencoding`: performs URL encoding/decoding on log data fields.
- `csv`: parses CSV data fields in logs.

Most processors have `field` or `fields` fields to specify the fields that need to be processed. Most processors will overwrite the original field after processing. If we do not want to affect the corresponding field in the original data, we can output the result to another field to avoid overwriting.
paomian marked this conversation as resolved.
Show resolved Hide resolved

When a field name contains `,`, the target field will be renamed. For example, `reqTimeSec, req_time_sec` means renaming the `reqTimeSec` field to `req_time_sec`, and the processed data will be written to the `req_time_sec` key in the intermediate state. The original `reqTimeSec` field is not affected. If some processors do not support field renaming, the renamed field name will be ignored and noted in the documentation.

for example

```yaml
processors:
- letter:
fields:
- message, message_upper
method: upper
ignore_missing: true
```

the `message` field will be converted to uppercase and stored in the `message_upper` field.

### `date`

The `date` processor is used to parse time fields. Here's an example configuration:
Expand Down Expand Up @@ -108,7 +125,7 @@ processors:

In the above example, the configuration of the `dissect` processor includes the following fields:

- `fields`: A list of field names to be split.
- `fields`: A list of field names to be split. does not support field renaming.
paomian marked this conversation as resolved.
Show resolved Hide resolved
- `patterns`: The dissect pattern for splitting.
- `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.
- `append_separator`: Specifies the separator for concatenating multiple fields with same field name together. Defaults to an empty string. See `+` modifier below.
Expand Down Expand Up @@ -260,7 +277,7 @@ processors:

In the above example, the configuration of the `regex` processor includes the following fields:

- `fields`: A list of field names to be matched.
- `fields`: A list of field names to be matched. If you rename the field, the renamed fields will be combined with the capture groups in `patterns` to generate the result name.
- `pattern`: The regular expression pattern to match. Named capture groups are required to extract corresponding data from the respective field.
- `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.

Expand Down Expand Up @@ -326,6 +343,125 @@ In the above example, the configuration of the `csv` processor includes the foll
- `trim`: Whether to trim whitespace. Defaults to `false`.
- `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.

### `json_path` (experimental)

Note: The `json_path` processor is currently in the experimental stage and may be subject to change.

The `json_path` processor is used to extract fields from JSON data. Here's an example configuration:

```yaml
processors:
- json_path:
fields:
- complex_object
json_path: "$.shop.orders[?(@.active)].id"
ignore_missing: true
result_index: 1
```

In the above example, the configuration of the `json_path` processor includes the following fields:

- `fields`: A list of field names to be extracted.
- `json_path`: The JSON path to extract.
- `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.
- `result_index`: Specifies the index of the value in the extracted array to be used as the result value. By default, all values are included. The extracted value of the processor is an array containing all the values of the path. If an index is specified, the corresponding value in the extracted array will be used as the final result.

#### JSON path syntax

The JSON path syntax is based on the [jsonpath-rust](https://github.com/besok/jsonpath-rust) library.

At this stage we only recommend using some simple field extraction operations to facilitate the extraction of nested fields to the top level.

#### `json_path` example

For example, given the following log data:

```json
{
"product_object": {
"hello": "world"
},
"product_array": [
"hello",
"world"
],
"complex_object": {
"shop": {
"orders": [
{
"id": 1,
"active": true
},
{
"id": 2
},
{
"id": 3
},
{
"id": 4,
"active": true
}
]
}
}
}
```

Using the following configuration:

```yaml
processors:
- json_path:
fields:
- product_object, object_target
json_path: "$.hello"
result_index: 0
- json_path:
fields:
- product_array, array_target
json_path: "$.[1]"
result_index: 0
- json_path:
fields:
- complex_object, complex_target_1
paomian marked this conversation as resolved.
Show resolved Hide resolved
json_path: "$.shop.orders[?(@.active)].id"
- json_path:
fields:
- complex_target_1, complex_target_2
json_path: "$.[1]"
result_index: 0
- json_path:
fields:
- complex_object, complex_target_3
json_path: "$.shop.orders[?(@.active)].id"
result_index: 1
transform:
- fields:
- object_target
- array_target
type: string
- fields:
- complex_target_3
- complex_target_2
type: uint32
- fields:
- complex_target_1
type: json
```

The result will be:

```json
{
"object_target": "world",
"array_target": "world",
"complex_target_3": 4,
"complex_target_2": 4,
"complex_target_1": [1, 4]
}
```


## Transform

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,23 @@ Processor 由一个 name 和多个配置组成,不同类型的 Processor 配
- `urlencoding`: 对 log 数据字段进行 URL 编解码。
- `csv`: 对 log 数据字段进行 CSV 解析。

大多数 Processor 都有 `field` 或 `fields` 字段。用于指定需要处理的字段。大部分 Processor 处理完成后会覆盖掉原先的 field。如果我们不想影响到原数据中的对应字段,我们可以把结果输出到其他字段来避免覆盖。
paomian marked this conversation as resolved.
Show resolved Hide resolved

当字段名称包含 `,` 时,该字段将被重命名。例如,`reqTimeSec, req_time_sec` 表示将 `reqTimeSec` 字段重命名为 `req_time_sec`,处理完成后的数据将写入中间状态的 `req_time_sec` key 中。原始的 `reqTimeSec` 字段不受影响。如果某些 Processor 不支持字段重命名,则重命名字段名称将被忽略,并将在文档中注明。
paomian marked this conversation as resolved.
Show resolved Hide resolved

例如:

```yaml
processors:
- letter:
fields:
- message, message_upper
method: upper
ignore_missing: true
```

`message` 字段将被转换为大写并存储在 `message_upper` 字段中。

### `date`

`date` Processor 用于解析时间字段。示例配置如下:
Expand Down Expand Up @@ -110,7 +127,7 @@ processors:

如上所示,`dissect` Processor 的配置包含以下字段:

- `fields`: 需要拆分的字段名列表。
- `fields`: 需要拆分的字段名列表。不支持字段重命名。
- `patterns`: 拆分的 dissect 模式。
- `ignore_missing`: 忽略字段不存在的情况。默认为 `false`。如果字段不存在,并且此配置为 false,则会抛出异常。
- `append_separator`: 对于多个追加到一起的字段,指定连接符。默认是一个空字符串。
Expand Down Expand Up @@ -262,7 +279,7 @@ processors:

如上所示,`regex` Processor 的配置包含以下字段:

- `fields`: 需要匹配的字段名列表。
- `fields`: 需要匹配的字段名列表。如果重命名了字段,重命名后的字段名将与 `pattern` 中的命名捕获组名进行拼接。
- `pattern`: 要进行匹配的正则表达式,需要使用命名捕获组才可以从对应字段中取出对应数据。
- `ignore_missing`: 忽略字段不存在的情况。默认为 `false`。如果字段不存在,并且此配置为 false,则会抛出异常。

Expand Down Expand Up @@ -329,6 +346,124 @@ processors:
- `trim`: 是否去除空格。默认为 `false`。
- `ignore_missing`: 忽略字段不存在的情况。默认为 `false`。如果字段不存在,并且此配置为 false,则会抛出异常。

### `json_path`(实验性)

注意:`json_path` 处理器目前处于实验阶段,可能会有所变动。

`json_path` 处理器用于从 JSON 数据中提取字段。以下是一个配置示例:

```yaml
processors:
- json_path:
fields:
- complex_object
json_path: "$.shop.orders[?(@.active)].id"
ignore_missing: true
result_index: 1
```

在上述示例中,`json_path` processor 的配置包括以下字段:

- `fields`:要提取的字段名称列表。
- `json_path`:要提取的 JSON 路径。
- `ignore_missing`:忽略字段缺失的情况。默认为 `false`。如果字段缺失且此配置设置为 `false`,将抛出异常。
- `result_index`:指定提取数组中要用作结果值的下标。默认情况下,包含所有值。Processor 提取的结果值是包含 path 中所有值的数组。如果指定了索引,将使用提取数组中对应的下标的值作为最终结果。

#### JSON 路径语法

JSON 路径语法基于 [jsonpath-rust](https://github.com/besok/jsonpath-rust) 库。

在此阶段,我们仅推荐使用一些简单的字段提取操作,以便将嵌套字段提取到顶层。

#### `json_path` 示例

例如,给定以下日志数据:

```json
{
"product_object": {
"hello": "world"
},
"product_array": [
"hello",
"world"
],
"complex_object": {
"shop": {
"orders": [
{
"id": 1,
"active": true
},
{
"id": 2
},
{
"id": 3
},
{
"id": 4,
"active": true
}
]
}
}
}
```

使用以下配置:

```yaml
processors:
- json_path:
fields:
- product_object, object_target
json_path: "$.hello"
result_index: 0
- json_path:
fields:
- product_array, array_target
json_path: "$.[1]"
result_index: 0
- json_path:
fields:
- complex_object, complex_target_1
json_path: "$.shop.orders[?(@.active)].id"
- json_path:
fields:
- complex_target_1, complex_target_2
json_path: "$.[1]"
result_index: 0
- json_path:
fields:
- complex_object, complex_target_3
json_path: "$.shop.orders[?(@.active)].id"
result_index: 1
transform:
- fields:
- object_target
- array_target
type: string
- fields:
- complex_target_3
- complex_target_2
type: uint32
- fields:
- complex_target_1
type: json
```

结果将是:

```json
{
"object_target": "world",
"array_target": "world",
"complex_target_3": 4,
"complex_target_2": 4,
"complex_target_1": [1, 4]
}
```

## Transform

Expand Down