Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filebeat][azure-blob-storage] - Added support for new features and removed partial save mechanism #36690

Merged
merged 16 commits into from
Oct 2, 2023

Conversation

ShourieG
Copy link
Contributor

@ShourieG ShourieG commented Sep 27, 2023

Type of change

  • Enhancement
  • Docs

Proposed commit message

Added support for the following features as part of a planned enhancement update:

  1. Root level splitting for JSON arrays, similar to GCS input.
  2. File selectors support similar to aws-s3 input, but only supports regex matching atm.
  3. A date filter for blobs based on unix epoch (seconds) format.
  4. Expand event list from field similar to aws-s3 input.
  5. Removed partial save mechanism for now due to concurrency issues.

These additions help make the input more robust for further use. Also added a couple of debug logs for better troubleshooting.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@ShourieG ShourieG requested a review from a team as a code owner September 27, 2023 14:16
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 27, 2023
@mergify
Copy link
Contributor

mergify bot commented Sep 27, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @ShourieG? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 27, 2023
@ShourieG ShourieG added needs_team Indicates that the issue/PR needs a Team:* label 8.11-candidate labels Sep 27, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 27, 2023
@ShourieG ShourieG added Filebeat Filebeat needs_team Indicates that the issue/PR needs a Team:* label labels Sep 27, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 27, 2023
@ShourieG ShourieG requested a review from efd6 September 27, 2023 14:21
@ShourieG ShourieG added the backport-skip Skip notification from the automated backport with mergify label Sep 27, 2023
@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 27, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-09-29T08:34:05.170+0000

  • Duration: 75 min 37 sec

Test stats 🧪

Test Results
Failed 0
Passed 3193
Skipped 174
Total 3367

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial look. I will have a deeper read tomorrow.

x-pack/filebeat/input/azureblobstorage/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
if j.offset == 0 {
r, j.isRootArray, err = evaluateJSON(bufio.NewReader(r))
if err != nil {
return fmt.Errorf("failed to evaluate json for blob: %s, with error: %w", *j.blob.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is j.blob.Name always non-nil?

Copy link
Contributor Author

@ShourieG ShourieG Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, a file/blob name will always exist, unless there is some unexpected issue from azure's end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check that at the site that the value becomes ours?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we fetch the blob pager, this value becomes localized on the machine.

x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/mock/data_random.go Outdated Show resolved Hide resolved
Copy link
Contributor

@bhapas bhapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few nits..

Overall the unit testing seems to be limited. How is this tested otherwise?

@mergify
Copy link
Contributor

mergify bot commented Sep 28, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature/azure-blob-storage upstream/feature/azure-blob-storage
git merge upstream/main
git push upstream feature/azure-blob-storage

Copy link
Contributor

@bhapas bhapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But please wait for @efd6 approval

@ShourieG
Copy link
Contributor Author

@efd6 @bhapas addressed all of the suggestions

x-pack/filebeat/input/azureblobstorage/input_stateless.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
[float]
==== `timestamp_epoch`

This attribute can be used to filter out files/blobs which have a timestamp greater than the specified value. The value of this attribute should be in unix `epoch` (seconds) format. The timestamp value is compared with the `LastModified Timestamp` obtained from the blob metadata. This attribute can be specified both at the root level of the configuration as well at the container level. The container level values will always take priority and override the root level values if both are specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v.Properties.LastModified.Unix() < *s.src.TimeStampEpoch { continue

The description doesn't match up with the code I saw. It seems like it filters blobs that are older than the value.

What is the use case for filtering based on a static value? I've seen inputs offer a relative time based filter when user specifies a duration since now like ignore_older: 168h or since: -168h.

Copy link
Contributor Author

@ShourieG ShourieG Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated the description. The use case here is that some customers in the past specially asked for the ability to filter out based on a specific date time, hence added this

// isValidUnixTimestamp checks if the timestamp is a valid Unix timestamp
func isValidUnixTimestamp(timestamp int64) bool {
// checks if the timestamp is within the valid range
return minTimestamp <= timestamp && timestamp <= maxTimestamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What determines the valid range? I can understand validating that the value is non-negative. But why is there an upper range?

The only reason I can think of for checking the upper value is as a sanity check. It probably does not make sense to configure any value that is in the future.

The bound checking can be added to the struct tag and ucfg will enforce it. Like validate:"min=1" for >=1. https://pkg.go.dev/github.com/elastic/go-ucfg#Config.Unpack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to this being an optional pointer value, struct tags validation causes a nil pointer deference. And yes this is used as a sanity check for permissible timestamp values.

x-pack/filebeat/input/azureblobstorage/config.go Outdated Show resolved Hide resolved
@ShourieG ShourieG changed the title [filebeat][azure-blob-storage] - Added support for new features. [filebeat][azure-blob-storage] - Added support for new features and removed partial save mechanism Sep 29, 2023
@ShourieG
Copy link
Contributor Author

@efd6 updated the PR with the removal of the partial saves

@ShourieG ShourieG merged commit ee55f1d into elastic:main Oct 2, 2023
@ShourieG ShourieG deleted the feature/azure-blob-storage branch October 2, 2023 04:10
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
…emoved partial save mechanism (elastic#36690)

* added new debug logs and updated deprecated library methods

* added support for path prefix, date filter and root level array splitting

* added support for fileselectors, timestamp filtering and expand event field

* updated asciidoc, updated comments

* updated changelog

* addressed PR suggestions

* addressed linting errors

* addressed further PR suggestions

* updated FileSelectors assignment condition

* partial save mechanism removed for now due to concurrency issues

* updated changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.11-candidate backport-skip Skip notification from the automated backport with mergify enhancement Filebeat Filebeat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Meta] Improve Azure blob storage input
5 participants