Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.17](backport #42306) x-pack/filebeat/docs - document gzip S3 object handling #42342

Merged
merged 2 commits into from
Jan 21, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Fix merge conflicts
  • Loading branch information
andrewkroh committed Jan 17, 2025
commit b354c42ace25f12589b2284078226e8f0cde75d1
76 changes: 0 additions & 76 deletions x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -82,81 +82,6 @@ Please see <<aws-credentials-config,Configuration parameters>> for alternate AWS
expand_event_list_from_field: Records
----

<<<<<<< HEAD
The `aws-s3` input supports the following configuration options plus the
<<{beatname_lc}-input-{type}-common-options>> described later.

=======
[float]
=== Document ID Generation

This aws-s3 input feature prevents the duplication of events in Elasticsearch by
generating a custom document `_id` for each event, rather than relying on
Elasticsearch to automatically generate one. Each document in an Elasticsearch
index must have a unique `_id`, and {beatname_uc} uses this property to avoid
ingesting duplicate events.

The custom `_id` is based on several pieces of information from the S3 object:
the Last-Modified timestamp, the bucket ARN, the object key, and the byte
offset of the data in the event.

Duplicate prevention is particularly useful in scenarios where {beatname_uc}
needs to retry an operation. {beatname_uc} guarantees at-least-once delivery,
meaning it will retry any failed or incomplete operations. These retries may be
triggered by issues with the host, `{beatname_uc}`, network connectivity, or
services such as Elasticsearch, SQS, or S3.

[float]
==== Limitations of `_id`-Based Deduplication

There are some limitations to consider when using `_id`-based deduplication in
Elasticsearch:

* Deduplication works only within a single index. The same `_id` can exist in
different indices, which is important if you're using data streams or index
aliases. When the backing index rolls over, a duplicate may be ingested.

* Indexing operations in Elasticsearch may take longer when an `_id` is
specified. Elasticsearch needs to check if the ID already exists before
writing, which can increase the time required for indexing.

[float]
==== Disabling Duplicate Prevention

If you want to disable the `_id`-based deduplication, you can remove the
document `_id` using the <<drop-fields,`drop_fields`>> processor in
{beatname_uc}.

["source","yaml",subs="attributes"]
----
{beatname_lc}.inputs:
- type: aws-s3
queue_url: https://queue.amazonaws.com/80398EXAMPLE/MyQueue
processors:
- drop_fields:
fields:
- '@metadata._id'
ignore_missing: true
----

Alternatively, you can remove the `_id` field using an Elasticsearch Ingest
Node pipeline.

["source","json",subs="attributes"]
----
{
"processors": [
{
"remove": {
"if": "ctx.input?.type == \"aws-s3\"",
"field": "_id",
"ignore_missing": true
}
}
]
}
----

[float]
=== Handling Compressed Objects

Expand All @@ -174,7 +99,6 @@ The `aws-s3` input supports the following configuration options plus the
NOTE: For time durations, valid time units are - "ns", "us" (or "µs"), "ms",
"s", "m", "h". For example, "2h"

>>>>>>> 7fd2d46de (x-pack/filebeat/docs/ - document gzip S3 object handling (#42306))
[float]
==== `api_timeout`

Expand Down
Loading