This project is a data processing application built with Apache Spark and Scala. This project is designed to efficiently process, analyze, and transform large datasets related to people data. It leverages Spark’s distributed computing capabilities to handle CSV, JSON, other structured data formats for scalable data ingestion, cleaning, and reporting. The codebase is modular, making it easy to extend for custom data pipelines or integrate with additional data sources. Shell scripts are included for streamlined deployment and execution.
- Fast and scalable data processing using Apache Spark
- Written primarily in Scala for performance and maintainability
- Modular pipeline for data ingestion, transformation, and export
- Shell scripts for automation and ease of use
- Suitable for batch processing of large people-related datasets
- Data ingestion from CSV and other sources
- Data cleaning and transformation using DataFrame, RDD and Dataset APIs
- Example queries and aggregations
- Shell scripts for running Hadoop jobs
- Modular Scala code for reusability
.
├── src/
│ └── main/
│ └── scala/
│ └── [Scala source files]
├── data/
│ └── people.csv
└── customers.csv
├── scripts/
│ └── [Shell scripts for running hadoop jobs]
├── README.md
- src/main/scala/: Scala source code using Spark Structured APIs
- data/: Example datasets (e.g., people.csv, customers.csv)
- scripts/: Shell scripts to run Hadoop jobs
- Java 8 or above
- Scala 2.12.x or 2.13.x
- Apache Spark 3.x
Clone the repository:
git clone https://github.com/pavithra19/apache_spark_people_data_processor.git
cd apache_spark_people_data_processor
- Modify or replace
/people.csv
with your own data. - Adjust the code in
src/main/scala/
as needed for your use case. - Use the provided shell scripts in
scripts/
to automate job execution in hadoop.
The default data/people.csv
should be in the following format:
name,age
Alice,19
Bob,20
Charlie,39