aws-emr
Here are 60 public repositories matching this topic...
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
-
Updated
Jun 13, 2022 - Python
An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.
-
Updated
Jun 5, 2024 - Python
A batch processing data pipeline, using AWS resources (S3, EMR, Redshift, EC2, IAM), provisioned via Terraform, and orchestrated from locally hosted Airflow containers. The end product is a Superset dashboard and a Postgres database, hosted on an EC2 instance at this address (powered down):
-
Updated
May 14, 2022 - Python
Cloud-based AI / ML workflow and data application development framework
-
Updated
Aug 20, 2024 - Python
A Grafana-based application to assist Big Data infrastructure optimization initiatives where Spark applications are a dominant cost driver
-
Updated
Jun 12, 2024 - Python
A collection of airflow sample workflows for data processing on aws
-
Updated
Dec 1, 2017 - Python
Create Data Lake on AWS S3 to store dimensional tables after processing data using Spark on AWS EMR cluster
-
Updated
Oct 10, 2019 - Python
A cookiecutter template for working with PySpark on AWS EMR
-
Updated
Aug 30, 2020 - Python
My AWS Playground
-
Updated
Jun 18, 2024 - Python
EMR + Hadoop to Redshift ELT workflow using spark steps API and orchestrated by Apache-Airflow, which ingests disparate datasets focused around 7Gb of I94 arrivals information to produce a simple star schema in Redshift
-
Updated
Feb 25, 2021 - Python
Daily Incremental load ETL pipeline for Ecommerce company using AWS Lambda and AWS EMR cluster, Deployed using Apache airflow in a docker container.
-
Updated
Mar 17, 2023 - Python
We Build an ETL pipeline using Airflow that accomplishes the following: Downloads data from an AWS S3 bucket, Runs a Spark/Spark SQL job on the downloaded data producing a cleaned-up dataset of delivery deadline missing orders and then Upload the cleaned-up dataset back to the same S3 bucket in a folder primed for higher level analytics
-
Updated
Feb 25, 2023 - Python
Generic python library that enables to provision emr clusters with yaml config files (Configuration as Code)
-
Updated
Dec 8, 2022 - Python
Lambda to start EMR and run a map reduce job
-
Updated
Aug 16, 2019 - Python
ETL pipeline with PySpark on EMR orchestrated with Airflow
-
Updated
Mar 16, 2021 - Python
Improve this page
Add a description, image, and links to the aws-emr topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the aws-emr topic, visit your repo's landing page and select "manage topics."