ktnsh24 / Spark_Data_processing_pipeline Public

The pipeline process the data in AWS EMR cluster with Spark

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Input File		Input File
Processed:output File		Processed:output File
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
SparkPipeline.png		SparkPipeline.png
readParquet.py		readParquet.py
sparkProcessingScript.ipynb		sparkProcessingScript.ipynb
sparkProcessingScript.py		sparkProcessingScript.py

Repository files navigation

Spark_Data_poecessing_pipeline

The pipeline process the data in AWS EMR cluster with Spark

In this pipeline, I created an Spark based Transient AWS EMR cluster. The cluster perform the 3 steps. Check the custom JAR script.

EMR cluster pull the sparkProcessingScript.py script.
The cluster read the data from s3 bucket (users_app_big_dataset.csv) and perform all the processing task on it.
After processing, save the processed data in the S3 Bucket as parquet file.

If you want to play locally without EMR cluster, check sparkProcessingScript.ipynb notebook. You need spark installed in your machine.