Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 34 additions & 26 deletions _data-prepper/pipelines/configuration/sources/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,9 @@

## Configuration

You can use the following options to configure the `s3` source.
You can use the following parameters to configure the `s3` source.

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- | :--- | :--- | :---
`notification_type` | Yes | String | Must be `sqs`.
`notification_source` | No | String | Determines how notifications are received by SQS. Must be `s3` or `eventbridge`. `s3` represents notifications that are directly sent from Amazon S3 to Amazon SQS or fanout notifications from Amazon S3 to Amazon Simple Notification Service (Amazon SNS) to Amazon SQS. `eventbridge` represents notifications from [Amazon EventBridge](https://aws.amazon.com/eventbridge/) and [Amazon Security Lake](https://aws.amazon.com/security-lake/). Default is `s3`.
Expand All @@ -112,7 +112,7 @@

The following parameters allow you to configure usage for Amazon SQS in the `s3` source plugin.

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- | :--- | :--- | :---
`queue_url` | Yes | String | The URL of the Amazon SQS queue from which messages are received.
`maximum_messages` | No | Integer | The maximum number of messages to receive from the Amazon SQS queue in any single request. Default is `10`.
Expand All @@ -125,7 +125,7 @@

## aws

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- | :--- | :--- | :---
`region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).
`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).
Expand All @@ -140,22 +140,30 @@

The `newline` codec parses each single line as a single log event. This is ideal for most application logs because each event parses per single line. It can also be suitable for S3 objects that have individual JSON objects on each line, which matches well when used with the [parse_json]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/parse-json/) processor to parse each line.

Use the following options to configure the `newline` codec.
Use the following parameters to configure the `newline` codec.

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- | :--- |:--------| :---
`skip_lines` | No | Integer | The number of lines to skip before creating events. You can use this configuration to skip common header rows. Default is `0`.
`header_destination` | No | String | A key value to assign to the header line of the S3 object. If this option is specified, then each event will contain a `header_destination` field.
`header_destination` | No | String | A key value to assign to the header line of the S3 object. If this parameter is specified, then each event will contain a `header_destination` field.

### json codec

The `json` codec parses each S3 object as a single JSON object from a JSON array and then creates a Data Prepper log event for each object in the array.
The `json` codec parses each S3 object as a single JSON object from a JSON array and then creates a Data Prepper log event for each object in the array.

Use the following parameters to configure the `json` codec.

Parameter | Required | Type | Description
:--- | :--- |:--------| :---
`key_name` | No | String | The name of the input field from which to extract the JSON array and create events.

### csv codec

The `csv` codec parses objects in comma-separated value (CSV) format, with each row producing a Data Prepper log event. Use the following options to configure the `csv` codec.
The `csv` codec parses objects in comma-separated value (CSV) format, with each row producing a Data Prepper log event.

Use the following parameters to configure the `csv` codec.

Option | Required | Type | Description
Parameters | Required | Type | Description
:--- |:---------|:------------| :---
`delimiter` | Yes | Integer | The delimiter separating columns. Default is `,`.
`quote_character` | Yes | String | The character used as a text qualifier for CSV data. Default is `"`.
Expand All @@ -164,9 +172,9 @@

## Using `s3_select` with the `s3` source<a name="s3_select"></a>

When configuring `s3_select` to parse Amazon S3 objects, use the following options:
When configuring `s3_select` to parse S3 objects, use the following parameters.

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- |:-----------------------|:------------| :---
`expression` | Yes, when using `s3_select` | String | The expression used to query the object. Maps directly to the [expression](https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html#AmazonS3-SelectObjectContent-request-Expression) property.
`expression_type` | No | String | The type of the provided expression. Default value is `SQL`. Maps directly to the [ExpressionType](https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html#AmazonS3-SelectObjectContent-request-ExpressionType).
Expand All @@ -177,28 +185,28 @@

### csv<a name="s3_select_csv"></a>

Use the following options in conjunction with the `csv` configuration for `s3_select` to determine how your parsed CSV file should be formatted.
Use the following parameters in conjunction with the `csv` configuration for `s3_select` to determine how your parsed CSV file should be formatted.

These options map directly to options available in the S3 Select [CSVInput](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CSVInput.html) data type.
These parameters map directly to inputs available in the S3 Select [CSVInput](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CSVInput.html) data type.

Check failure on line 190 in _data-prepper/pipelines/configuration/sources/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: CSVInput. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: CSVInput. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sources/s3.md", "range": {"start": {"line": 190, "column": 69}}}, "severity": "ERROR"}

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- |:---------|:------------| :---
`file_header_info` | No | String | Describes the first line of input. Maps directly to the [FileHeaderInfo](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CSVInput.html#AmazonS3-Type-CSVInput-FileHeaderInfo) property.
`quote_escape` | No | String | A single character used for escaping the quotation mark character inside an already escaped value. Maps directly to the [QuoteEscapeCharacter](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CSVInput.html#AmazonS3-Type-CSVInput-QuoteEscapeCharacter) property.
`comments` | No | String | A single character used to indicate that a row should be ignored when the character is present at the start of that row. Maps directly to the [Comments](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CSVInput.html#AmazonS3-Type-CSVInput-Comments) property.

#### json<a name="s3_select_json"></a>

Use the following option in conjunction with `json` for `s3_select` to determine how S3 Select processes the JSON file.
Use the following parameters in conjunction with `json` for `s3_select` to determine how S3 Select processes the JSON file.

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- | :--- | :--- | :---
`type` | No | String | The type of JSON array. May be either `DOCUMENT` or `LINES`. Maps directly to the [Type](https://docs.aws.amazon.com/AmazonS3/latest/API/API_JSONInput.html#AmazonS3-Type-JSONInput-Type) property.

## Using `scan` with the `s3` source<a name="scan"></a>
The following parameters allow you to scan S3 objects. All options can be configured at the bucket level.
The following parameters allow you to scan S3 objects. All parameters can be configured at the bucket level.

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- | :--- | :--- | :---
`start_time` | No | String | The time from which to start scanning objects modified after the given `start_time`. This should follow [ISO LocalDateTime](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_DATE_TIME) format, for example, `023-01-23T10:00:00`. If `end_time` is configured along with `start_time`, all objects after `start_time` and before `end_time` will be processed. `start_time` and `range` cannot be used together.
`end_time` | No | String | The time after which no objects will be scanned after the given `end_time`. This should follow [ISO LocalDateTime](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_DATE_TIME) format, for example, `023-01-23T10:00:00`. If `start_time` is configured along with `end_time`, all objects after `start_time` and before `end_time` will be processed. `end_time` and `range` cannot be used together.
Expand All @@ -210,13 +218,13 @@
### scan bucket
<!-- vale on -->

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- | :--- |:-----| :---
`bucket` | Yes | Map | Provides options for each bucket.
`bucket` | Yes | Map | Provides parameters for each bucket.

You can configure the following options in the `bucket` setting map.
You can configure the following parameters in the `bucket` setting map.

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- | :--- | :--- | :---
`name` | Yes | String | The string representing the S3 bucket name to scan.
`filter` | No | [Filter](#filter) | Provides the filter configuration.
Expand All @@ -226,16 +234,16 @@

### filter

Use the following options inside the `filter` configuration.
Use the following parameters in the `filter` configuration.

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- | :--- | :--- | :---
`include_prefix` | No | List | A list of S3 key prefix strings included in the scan. By default, all the objects in a bucket are included.
`exclude_suffix` | No | List | A list of S3 key suffix strings excluded from the scan. By default, no objects in a bucket are excluded.

### scheduling

Option | Required | Type | Description
Parameter | Required | Type | Description
:--- | :--- | :--- | :---
`interval` | Yes | String | Indicates the minimum interval between each scan. The next scan in the interval will start after the interval duration from the last scan ends and when all the objects from the previous scan are processed. Supports ISO 8601 notation strings, such as `PT20.345S` or `PT15M`, and notation strings for seconds (`60s`) and milliseconds (`1600ms`).
`count` | No | Integer | Specifies how many times a bucket will be scanned. Defaults to `Integer.MAX_VALUE`.
Expand Down
Loading