Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filebeat][GCS] - Improved documentation #41143

Merged
merged 4 commits into from
Oct 8, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
updated and improved documentation
  • Loading branch information
ShourieG committed Oct 6, 2024
commit b956eb780fa7acee15377a800e439fab53da4f4b
14 changes: 11 additions & 3 deletions x-pack/filebeat/docs/inputs/input-gcs.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -216,14 +216,16 @@ It can be defined in the following formats : `{{x}}s`, `{{x}}m`, `{{x}}h`, here
If no value is specified for this, by default its initialized to `50 seconds`. This attribute can be specified both at the root level of the configuration as well at the bucket level.
The bucket level values will always take priority and override the root level values if both are specified.

NOTE: The `bucket_timeout` should depend on the size of the files and the network speed. If the timeout is too low, the input will not be able to read the file completely and `context_deadline_exceeded` errors will be seen in the logs. If the timeout is too high, the input will wait for a long time for the file to be read, which can cause the input to be slow. The ratio between the `bucket_timeout` and `poll_interval` should be considered while setting both the values. A low `poll_interval` and a very high `bucket_timeout` can cause resource utilization issues and as schedule ops will be spawned every poll iteration, so if previous poll ops are still running, it could cause a bottleneck over time.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

[id="attrib-max_workers-gcs"]
[float]
==== `max_workers`

This attribute defines the maximum number of workers (go routines / lightweight threads) are allocated in the worker pool (thread pool) for processing jobs
ShourieG marked this conversation as resolved.
Show resolved Hide resolved
which read contents of file. More number of workers equals a greater amount of concurrency achieved. There is an upper cap of `5000` workers per bucket that
can be defined due to internal sdk constraints. This attribute can be specified both at the root level of the configuration as well at the bucket level.
The bucket level values will always take priority and override the root level values if both are specified.
which read contents of file. This attribute can be specified both at the root level of the configuration as well at the bucket level. The bucket level values will always take priority and override the root level values if both are specified. More number of workers does not always equal to a greater amount of concurrency and this should be carefully tuned based on the number of files, the size of the files being processed and resources available. Increasing `max_workers` to very high thresholds can cause resource utilization issues and can lead to a bottleneck in processing. Usually a maximum cap of `2000` workers is recommended.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

NOTE: The value of `max_workers` is currently tied to the `batch_size` internally. This `batch_size` determines how many objects will be fetched in one single call. The `max_workers` value should be set based on the number of the and the network speed. A very low `max_worker` count will drastically increase the number of network calls required to fetch the objects, which can cause a bottleneck in processing. The `max_workers` size is tied to the `batch_size` currently to ensure even distribution of workloads across all go routines. This ensures that the input is able to process the files in an efficient manner.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

[id="attrib-poll-gcs"]
[float]
Expand All @@ -243,6 +245,8 @@ Example : `10s` would mean we would like the polling to occur every 10 seconds.
This attribute can be specified both at the root level of the configuration as well at the bucket level. The bucket level values will always take priority
and override the root level values if both are specified.

NOTE: In an ideal scenario the `poll_interval` should be set to a value that is equal to the `bucket_timeout` value. This would ensure that another schedule op is not started before the current buckets have all been processed. If the `poll_interval` is set to a value that is less than the `bucket_timeout`, then the input will start another schedule op before the current one has finished, which can cause a bottleneck over time. Having a lower `poll_interval` can make the input faster at the cost of more resource utilization.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

[id="attrib-parse_json"]
[float]
==== `parse_json`
Expand Down Expand Up @@ -276,6 +280,8 @@ filebeat.inputs:
- regex: '/Security-Logs/'
----

NOTE: The `file_selectors` op is performed within the agent locally which scales vertically, hence using this option will cause the agent to download all the files and then filter them. This can cause a bottleneck in processing if the number of files is very high. It is recommended to use this attribute only when the number of files is limited or ample resources are available.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved
ShourieG marked this conversation as resolved.
Show resolved Hide resolved
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

[id="attrib-expand_event_list_from_field-gcs"]
[float]
==== `expand_event_list_from_field`
Expand Down Expand Up @@ -341,6 +347,8 @@ filebeat.inputs:
timestamp_epoch: 1630444800
----

NOTE: The GCS api's don't provide a direct way to filter files based on the timestamp, so the input will download all the files and then filter them based on the timestamp. This can cause a bottleneck in processing if the number of files are very high. It is recommended to use this attribute only when the number of files are limited or ample resources are available. This option scales vertically and not horizontally.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

[id="bucket-overrides"]
*The sample configs below will explain the bucket level overriding of attributes a bit further :-*

Expand Down
Loading