Lambda-promtail: Add support for processing SQS messages, add promtailClient Type, add logger, upgrade dependencies and fix unexpected flushing behaviors #8231

CCOLLOT · 2023-01-21T07:33:20Z

What this PR does / why we need it:

Context

I want to :

Use lambda-promtail to ingest AWS VPC flow logs from S3 files, deploying it as an AWS lambda function.
Configure the s3 Notification to send messages to SQS instead of directly to lambda in order to be able to spin up a secondary SQS queue to be used as a dead-letter queue lambda can send unsuccessfully processed events to.
- This way I am able to use the AWS SQS redrive feature to process logs after fixing an outage in the ingestion flow and I am not limited by lambda giving up after 2 (default) retries.

I faced several issues:

lambda-promtail tries to flush a batch when it gets bigger than BATCH_SIZE. The current implementation does not reset the size of the batch, leading to a flush to loki/promtail for every single added line after reaching BATCH_SIZE (and making the below file descriptor issue much worse because the http.Client is currently created in the send function.
lambda-promtail reached the quota of file_descriptors of AWS lambda (1024) very quickly due to http.Client being created thousands of times when processing large files (a VPC flow log file can reach dozens of MB very quick, meaning hundreds of thousands of lines.

Added Feature

Add support for processing events from an SQS queue
- Side effect: add support for on-failure recovery using and SQS queue.
- If merged, I would happily upload my terraform sample code to setup the S3 --> SQS main--> lambda --> SQS DLQ -->redrive to SQS main

Fixed bugs

The flushBatch function would reset the stream after flushing but it wouldn't reset the size of the batch, leading to lambda-promtail flushing every single following line to loki/promtail. In my case it led to logs being processed very slowly by lambda when the total size is way above the default batchSize. I would reach the 15 min timeout on AWS lambda for almost every single file.
The http.Clientused to flush logs to loki/promtail was being recreated thousands of times, which made the process quickly run out of file descriptors (limited to 1024 in AWS lambda). I set it as a global variable to avoid impacting other parts of the code for now but if required we could do it a little cleaner via dependency injection.

After fixing those bugs my average lambda execution time for vpc flow log files (size between 10mb and 100mb) went from timing out at 15 min to ~3 seconds.

Upgrades:

Go version to 1.19
Dependencies using go get -u ./...

Other

Reorganized imports
Fixed some typos

Special notes for your reviewer:

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
CHANGELOG.md
Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

grafanabot · 2023-01-21T07:38:08Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot · 2023-01-21T12:31:22Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot · 2023-01-21T12:41:47Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot · 2023-01-23T10:10:43Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot · 2023-01-24T10:01:16Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

cstyan

few minor comments

reviewing this I also started thinking that it might make sense to implement a client type within lambda promtail that we could attach functions to, and it could store the promtailClient reference rather than creating more package level globals

tools/lambda-promtail/lambda-promtail/main.go

tools/lambda-promtail/lambda-promtail/main_test.go

tools/lambda-promtail/lambda-promtail/promtail.go

cstyan · 2023-01-27T21:53:25Z

tools/lambda-promtail/lambda-promtail/s3.go

@@ -51,7 +52,7 @@ func getS3Object(ctx context.Context, labels map[string]string) (io.ReadCloser,
 		s3Client = s3.NewFromConfig(cfg)
 		s3Clients[labels["bucket_region"]] = s3Client
 	}
-
+	fmt.Println("fetching", labels["key"])


did you mean to include this println?

I used it for debugging purposes and ended up leaving it there since it's a useful piece of information (it helps immediately know what event/file is being processed) and because it prints only 1 line per lambda invocation (so it's not like it's going to cost a lot in Cloudwatch logs).

As the code is growing maybe we should have a more structured approach.
Maybe we could remove this line and open up another MR to use the standard logger (and set the LOG_LEVEL or DEBUG using an environment variable)? WDYT?

I'm fine with keeping as is or implementing something more structured.

Okay I implemented the logger in my latest commit, I went with go-kit since it's the one used in the main project

CCOLLOT · 2023-01-27T23:02:40Z

Thanks! I will update this PR with your comments and the new client type ✌️

grafanabot · 2023-01-29T22:08:41Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot · 2023-01-29T22:16:53Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

cstyan · 2023-01-30T04:56:55Z

Looks like there's a conflict with your CHANGELOD.md changes. Might be easiest to remove that commit, merge upstream main into your branch, and then make your changelog entry again. I'll have a look at the client and logger changes in the morning.

grafanabot · 2023-01-30T09:33:22Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

tools/lambda-promtail/lambda-promtail/promtail_client.go

cstyan · 2023-01-31T19:55:55Z

tools/lambda-promtail/lambda-promtail/s3.go

-		s3Clients[labels["bucket_region"]] = s3Client
-	}
-	fmt.Println("fetching", labels["key"])
-	obj, err := s3Client.GetObject(ctx,
-		&s3.GetObjectInput{
-			Bucket:              aws.String(labels["bucket"]),
-			Key:                 aws.String(labels["key"]),
-			ExpectedBucketOwner: aws.String(labels["bucketOwner"]),
-		})
-
-	if err != nil {
-		fmt.Printf("Failed to get object %s from bucket %s on account %s\n", labels["key"], labels["bucket"], labels["bucketOwner"])
-		return nil, err


why are we removing the object fetching code here and moving it?

This gets3Object function does two different things, it gets an *s3.Client from the client map (and updates the map if creating a new client), then it fetches the s3 object.
I thought it made more sense to move the client part to a dedicated getS3Client function (besides, one of the bugs fixed with this MR is actually linked to a "do_something_with_a_client" function with the said client being created in the same function).
Doing so left the gets3Object function as a wrapper just calling s3.getObject without any other action, so it made little sense to keep it, which is why I moved it back to processS3Event.

If it does not make sense and/or if it is not a desired change I can just set it back the way it was. What do you think?

👍 got it

JStickler

[Docs squad] Couple of small suggestions.

docs/sources/clients/lambda-promtail/_index.md

JStickler

[docs sqad] docs look good, I'll leave it to engineering to approve the code.

cstyan · 2023-02-06T19:11:54Z

I appreciate the bug fixes 👍

Having never used SQS myself I can't speak to the validity of the SQS specific code or how you're talking about using it, but I trust that you or other users will be using it enough to point out any issues that need fixing going forward.

I'll leave this for a day just to have one last read through and see if anyone else has comments, and then merge.

CCOLLOT · 2023-02-06T19:26:49Z

I appreciate the bug fixes 👍

Having never used SQS myself I can't speak to the validity of the SQS specific code or how you're talking about using it, but I trust that you or other users will be using it enough to point out any issues that need fixing going forward.

I'll leave this for a day just to have one last read through and see if anyone else has comments, and then merge.

Glad I could help. I've been running this code for days/weeks now and it runs smoothly.

Thanks for the review. ✌️

cstyan · 2023-02-20T19:12:58Z

hey @CCOLLOT my bad for leaving this unmerged, do you mind just resolving the conflicts? I will merge after that

CCOLLOT · 2023-02-20T20:03:59Z

hey @CCOLLOT my bad for leaving this unmerged, do you mind just resolving the conflicts? I will merge after that

There you go. I had to fix main_test.go because it prevented the project from building successfully. I also removed my changes to go.sum and go.mod and went with main's files (even if they use older versions in general)

…8700) ## Add Terraform sample code to process AWS S3 log files through an SQS queue Lambda-promtail supports processing AWS s3 logs files through an SQS queue since #8231 As explained in the [documentation section of lambda-promtail](https://github.com/grafana/loki/blob/main/docs/sources/clients/lambda-promtail/_index.md#triggering-lambda-promtail-via-sqs:~:text=Triggering%20Lambda%2DPromtail,DLQ%20redrive%20feature.), this can be leveraged to re-process logs that lambda-promtail failed to process (or send to loki) using a main SQS queue and a secondary SQS dead-letter queue. AWS has a feature called `SQS redrive`, which enables routing messages pending in the DLQ back to the main (source) queue. This PR demonstrates how to spin up this architecture using terraform.

It looks like grafana#8231 clashed with grafana#8750 in such a way that causes applying the module to fail.

) **What this PR does / why we need it**: Currently, the Terraform module for lambda-promtail is not working due to a misnamed reference: ``` ╷ │ Error: Reference to undeclared resource │ │ on .terraform/modules/lambda_promtail/tools/lambda-promtail/sqs.tf line 66, in resource "aws_iam_role_policy_attachment" "lambda_sqs_execution": │ 66: role = aws_iam_role.iam_for_lambda.name │ │ A managed resource "aws_iam_role" "iam_for_lambda" has not been declared in │ module.lambda_promtail. ╵ ``` It looks like this was the result of a conflict with #8231 and #8750. While I was in there I also refactored the SQS work to be more consistent with #8750. I also made the queue name prefix configurable with the `sqs_queue_name_prefix` variable, as it was hardcoded before which could cause issues if multiple instances of this module are initialized in the same AWS account and region. **Which issue(s) this PR fixes**: Fixes #<issue number> **Special notes for your reviewer**: **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

CCOLLOT requested a review from a team as a code owner January 21, 2023 07:33

pull-request-size bot added the size/XL label Jan 21, 2023

github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Jan 21, 2023

CCOLLOT changed the title ~~Add support for processing SQS messages, upgrade dependencies and fix unexpected flushing behaviors~~ Lambda-promtail: Add support for processing SQS messages, upgrade dependencies and fix unexpected flushing behaviors Jan 21, 2023

CCOLLOT mentioned this pull request Jan 25, 2023

Lambda-promtail: add support for AWS WAF S3 logs #8254

Closed

5 tasks

cstyan self-assigned this Jan 25, 2023

cstyan reviewed Jan 27, 2023

View reviewed changes

CCOLLOT requested a review from JStickler as a code owner January 29, 2023 22:03

CCOLLOT force-pushed the add-sqs-event-handler branch from a0b2d22 to ffdb7a4 Compare January 30, 2023 09:19

github-actions bot added area/helm sig/operator labels Jan 30, 2023

CCOLLOT force-pushed the add-sqs-event-handler branch from ffdb7a4 to 06bf6a1 Compare January 30, 2023 09:28

github-actions bot removed sig/operator area/helm labels Jan 30, 2023

cstyan reviewed Jan 31, 2023

View reviewed changes

CCOLLOT force-pushed the add-sqs-event-handler branch from 16ee004 to 48a597b Compare January 31, 2023 21:27

JStickler reviewed Feb 2, 2023

View reviewed changes

docs/sources/clients/lambda-promtail/_index.md Outdated Show resolved Hide resolved

docs/sources/clients/lambda-promtail/_index.md Outdated Show resolved Hide resolved

JStickler approved these changes Feb 3, 2023

View reviewed changes

CCOLLOT added 9 commits February 11, 2023 13:02

add sqs event handler, upgrade versions and fix bug in flushing behavior

05833ff

add missing s3TestEvent to the list of checked types

e6e488d

fix typo

30c4803

update Changelog.md

8a77af8

update documentation

393a370

implement promtail Client, Logger and refactor s3 getS3Object

967f698

rebase and fix changelog.md

3ac9f65

change interface name to Client

b85bed8

update documentation

6c54cd5

CCOLLOT force-pushed the add-sqs-event-handler branch from eabe6df to 6c54cd5 Compare February 11, 2023 12:02

fix merge conflicts

92b0c34

pull-request-size bot added size/L and removed size/XL labels Feb 20, 2023

go mod tidy

79a20a7

cstyan approved these changes Feb 21, 2023

View reviewed changes

cstyan merged commit a013e9f into grafana:main Feb 21, 2023

CCOLLOT mentioned this pull request Mar 4, 2023

lambda-promtail: add terraform code to process s3 logs through SQS #8700

Merged

sapslaj mentioned this pull request Jul 11, 2023

lambda-promtail: fix/refactor SQS integration in Terraform module #9909

Merged

7 tasks

sapslaj added a commit to sapslaj/loki that referenced this pull request Aug 2, 2023

fix(lambda-promtail): correct the logical name of the IAM role

e4ccafd

It looks like grafana#8231 clashed with grafana#8750 in such a way that causes applying the module to fail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lambda-promtail: Add support for processing SQS messages, add promtailClient Type, add logger, upgrade dependencies and fix unexpected flushing behaviors #8231

Lambda-promtail: Add support for processing SQS messages, add promtailClient Type, add logger, upgrade dependencies and fix unexpected flushing behaviors #8231

CCOLLOT commented Jan 21, 2023 •

edited

Loading

grafanabot commented Jan 21, 2023

grafanabot commented Jan 21, 2023

grafanabot commented Jan 21, 2023

grafanabot commented Jan 23, 2023

grafanabot commented Jan 24, 2023

cstyan left a comment

cstyan Jan 27, 2023

CCOLLOT Jan 27, 2023

cstyan Jan 27, 2023

CCOLLOT Jan 29, 2023

CCOLLOT commented Jan 27, 2023 •

edited

Loading

grafanabot commented Jan 29, 2023

grafanabot commented Jan 29, 2023

cstyan commented Jan 30, 2023

grafanabot commented Jan 30, 2023

cstyan Jan 31, 2023

CCOLLOT Jan 31, 2023 •

edited

Loading

cstyan Feb 6, 2023

JStickler left a comment

JStickler left a comment

cstyan commented Feb 6, 2023

CCOLLOT commented Feb 6, 2023

cstyan commented Feb 20, 2023

CCOLLOT commented Feb 20, 2023

Lambda-promtail: Add support for processing SQS messages, add promtailClient Type, add logger, upgrade dependencies and fix unexpected flushing behaviors #8231

Lambda-promtail: Add support for processing SQS messages, add promtailClient Type, add logger, upgrade dependencies and fix unexpected flushing behaviors #8231

Conversation

CCOLLOT commented Jan 21, 2023 • edited Loading

Context

Added Feature

Fixed bugs

Upgrades:

Other

grafanabot commented Jan 21, 2023

grafanabot commented Jan 21, 2023

grafanabot commented Jan 21, 2023

grafanabot commented Jan 23, 2023

grafanabot commented Jan 24, 2023

cstyan left a comment

Choose a reason for hiding this comment

cstyan Jan 27, 2023

Choose a reason for hiding this comment

CCOLLOT Jan 27, 2023

Choose a reason for hiding this comment

cstyan Jan 27, 2023

Choose a reason for hiding this comment

CCOLLOT Jan 29, 2023

Choose a reason for hiding this comment

CCOLLOT commented Jan 27, 2023 • edited Loading

grafanabot commented Jan 29, 2023

grafanabot commented Jan 29, 2023

cstyan commented Jan 30, 2023

grafanabot commented Jan 30, 2023

cstyan Jan 31, 2023

Choose a reason for hiding this comment

CCOLLOT Jan 31, 2023 • edited Loading

Choose a reason for hiding this comment

cstyan Feb 6, 2023

Choose a reason for hiding this comment

JStickler left a comment

Choose a reason for hiding this comment

JStickler left a comment

Choose a reason for hiding this comment

cstyan commented Feb 6, 2023

CCOLLOT commented Feb 6, 2023

cstyan commented Feb 20, 2023

CCOLLOT commented Feb 20, 2023

CCOLLOT commented Jan 21, 2023 •

edited

Loading

CCOLLOT commented Jan 27, 2023 •

edited

Loading

CCOLLOT Jan 31, 2023 •

edited

Loading