Skip to content

New-Math-Data/nmd-data-engineer-test-tf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Objective

Design and implement an end-to-end serverless data processing pipeline using AWS resources, provisioned through Terraform. This pipeline should ingest new order files delivered to an S3 bucket, process them to generate analytics reports, and output the results to Glue tables stored in S3 for querying.

Assignment Description

You are tasked with building a data processing pipeline using the following AWS services:

  • S3: Acts as the storage layer for incoming order files and output analytics reports.
  • AWS Lambda (Dockerized): Processes new order files from S3 and generates analytics reports.
  • AWS Glue: Manages the metadata for input and output data and enables querying of the analytics reports.

Requirements:

  1. Terraform Resources

Write Terraform code to provision the following resources:

  • S3 Buckets:
    • One bucket to store incoming order files.
    • One bucket to store the processed analytics reports.
  • AWS Lambda Function:
    • Use a Docker container for the Lambda function runtime.
    • The function should process new files uploaded to the input S3 bucket and generate the analytics reports.
  • AWS Glue Resources:
    • Create a Glue database and tables to store the metadata for the processed analytics reports. Glue tables are created here
    • The tables should correspond to the following analytics reports:
      1. Most Profitable Region: A table containing the region and its total profit.
      2. Most Common Shipping Method by Category: A table mapping each product category to its most common shipping method.
      3. Order Counts by Category and Sub-Category: A table showing the number of orders for each category and sub-category.
  1. Lambda Function Logic

Write the Python code for the Lambda function to:

  • Read the order data from newly uploaded files in the input S3 bucket (assume CSV format).
  • Compute the required analytics reports:
    • Most profitable region.
    • Most common shipping method for each product category.
    • Number of orders by category and sub-category.
    • Output the results as CSV files to the output S3 bucket.

Deliverables:

  1. Terraform Code:
  • A complete set of Terraform scripts to provision all required AWS resources.
  • Clear comments and modular design (use of modules is a plus).
  1. Lambda Code:
  • The Python code for the Lambda function, with clear documentation and structure.
  • Include a Dockerfile to package the Lambda function as a Docker image.
  1. Testing and Deployment Instructions:
  • Provide instructions to deploy and test the infrastructure and code.
  • Include commands for running Terraform, building the Docker image, and deploying the Lambda function.
  1. Readme file updates on how to deploy the application.
  2. Bonus:
    • Provide SQL queries that can be used in Athena to query the Glue tables for the three analytics reports.
    • Add IAM policies that follow the principle of least privilege.

Evaluation Criteria:

  • Correctness: Does the pipeline meet the requirements and generate the expected outputs?
  • Code Quality: Is the Terraform code modular and well-structured? Is the Lambda function code readable and efficient?
  • Documentation: Are the deployment and testing instructions clear? Are resources and configurations well-documented?
  • Best Practices: Are AWS resources secured and follow best practices (e.g., IAM roles, S3 bucket policies)?
  • Discussion question responses.

Good luck, and feel free to ask any clarifying questions!

Terraform

<<<<<<< Updated upstream

Deploying the assignment

To deploy the assignment

Assignment

To deploy the assignment state

Stashed changes

cd terraform/assignment
terraform init -backend-config="key=nmd-assignment-<candidate-name>.tfstate"
terraform plan -var 'aws_profile=<YOUR_PROFILE>'
terraform apply -var 'aws_profile=<YOUR_PROFILE>'

Dockerfile - Test Locally

Test locally by running the following

docker build --platform linux/x86_64 -t docker-image:test .
docker run --platform linux/x86_64 -p 9000:8080 docker-image:test

This creates a localhost that you can then pass an event into to see how it handles it. Run the following curl command to call the lambda from within the container. Where '{}' contains your event information.

curl "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published