Design and implement an end-to-end serverless data processing pipeline using AWS resources, provisioned through Terraform. This pipeline should ingest new order files delivered to an S3 bucket, process them to generate analytics reports, and output the results to Glue tables stored in S3 for querying.
You are tasked with building a data processing pipeline using the following AWS services:
- S3: Acts as the storage layer for incoming order files and output analytics reports.
- AWS Lambda (Dockerized): Processes new order files from S3 and generates analytics reports.
- AWS Glue: Manages the metadata for input and output data and enables querying of the analytics reports.
- Terraform Resources
Write Terraform code to provision the following resources:
- S3 Buckets:
- One bucket to store incoming order files.
- One bucket to store the processed analytics reports.
- AWS Lambda Function:
- Use a Docker container for the Lambda function runtime.
- The function should process new files uploaded to the input S3 bucket and generate the analytics reports.
- AWS Glue Resources:
- Create a Glue database and tables to store the metadata for the processed analytics reports. Glue tables are created here
- The tables should correspond to the following analytics reports:
- Most Profitable Region: A table containing the region and its total profit.
- Most Common Shipping Method by Category: A table mapping each product category to its most common shipping method.
- Order Counts by Category and Sub-Category: A table showing the number of orders for each category and sub-category.
- Lambda Function Logic
Write the Python code for the Lambda function to:
- Read the order data from newly uploaded files in the input S3 bucket (assume CSV format).
- Compute the required analytics reports:
- Most profitable region.
- Most common shipping method for each product category.
- Number of orders by category and sub-category.
- Output the results as CSV files to the output S3 bucket.
Deliverables:
- Terraform Code:
- A complete set of Terraform scripts to provision all required AWS resources.
- Clear comments and modular design (use of modules is a plus).
- Lambda Code:
- The Python code for the Lambda function, with clear documentation and structure.
- Include a Dockerfile to package the Lambda function as a Docker image.
- Testing and Deployment Instructions:
- Provide instructions to deploy and test the infrastructure and code.
- Include commands for running Terraform, building the Docker image, and deploying the Lambda function.
- Readme file updates on how to deploy the application.
- Bonus:
- Provide SQL queries that can be used in Athena to query the Glue tables for the three analytics reports.
- Add IAM policies that follow the principle of least privilege.
Evaluation Criteria:
- Correctness: Does the pipeline meet the requirements and generate the expected outputs?
- Code Quality: Is the Terraform code modular and well-structured? Is the Lambda function code readable and efficient?
- Documentation: Are the deployment and testing instructions clear? Are resources and configurations well-documented?
- Best Practices: Are AWS resources secured and follow best practices (e.g., IAM roles, S3 bucket policies)?
- Discussion question responses.
Good luck, and feel free to ask any clarifying questions!
<<<<<<< Updated upstream
To deploy the assignment state
Stashed changes
cd terraform/assignment
terraform init -backend-config="key=nmd-assignment-<candidate-name>.tfstate"
terraform plan -var 'aws_profile=<YOUR_PROFILE>'
terraform apply -var 'aws_profile=<YOUR_PROFILE>'
Test locally by running the following
docker build --platform linux/x86_64 -t docker-image:test .
docker run --platform linux/x86_64 -p 9000:8080 docker-image:test
This creates a localhost that you can then pass an event into to see how it handles it. Run the following curl command to call the lambda from within the container. Where '{}' contains your event information.
curl "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'