An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
-
Updated
Mar 9, 2020 - Python
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS
BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics
Classwork projects and home works done through Udacity data engineering nano degree
Reference Architectures for Datalakes on AWS
Bits of code I use during live demos
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.
A large-scale data framework that will enable us to store and analyze financial market data and drive future predictions for investment.
Apache Spark TPC-DS benchmark setup with EMR launch setup
A Cassandra Architecture for GDELT Database 🌍
A boilerplate for spark projects with docker support for local development and scripts for emr support.
An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.
Uses EMR clusters to export dynamoDB tables to S3 and generates import steps
Create a data pipeline on AWS to execute batch processing in a Spark cluster provisioned by Amazon EMR. ETL using managed airflow: extracts data from S3, transform data using spark, load transformed data back to S3.
This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.
This project is to prove the efficiency of distributed computing and distributed database. The machine learning multiple classification algorithms in spark were used to predict the Air Quality Index in California.
Event driven EMR via Serverless
Hosting data lake with bid-ask data in S3 using Spark and Airflow
Add a description, image, and links to the emr-cluster topic page so that developers can more easily learn about it.
To associate your repository with the emr-cluster topic, visit your repo's landing page and select "manage topics."