Apache Spark People and Customer Subscriptions Processor Project

This project is a data processing application built with Apache Spark and Scala. This project is designed to efficiently process, analyze, and transform large datasets related to people data. It leverages Spark’s distributed computing capabilities to handle CSV, JSON, other structured data formats for scalable data ingestion, cleaning, and reporting. The codebase is modular, making it easy to extend for custom data pipelines or integrate with additional data sources. Shell scripts are included for streamlined deployment and execution.

Overview

Fast and scalable data processing using Apache Spark
Written primarily in Scala for performance and maintainability
Modular pipeline for data ingestion, transformation, and export
Shell scripts for automation and ease of use
Suitable for batch processing of large people-related datasets

Features

Data ingestion from CSV and other sources
Data cleaning and transformation using DataFrame, RDD and Dataset APIs
Example queries and aggregations
Shell scripts for running Hadoop jobs
Modular Scala code for reusability

Project Structure

.
├── src/
│   └── main/
│       └── scala/
│           └── [Scala source files]
├── data/
│   └── people.csv
    └── customers.csv
├── scripts/
│   └── [Shell scripts for running hadoop jobs]
├── README.md

src/main/scala/: Scala source code using Spark Structured APIs
data/: Example datasets (e.g., people.csv, customers.csv)
scripts/: Shell scripts to run Hadoop jobs

Prerequisites

Java 8 or above
Scala 2.12.x or 2.13.x
Apache Spark 3.x

Getting Started

Clone the repository:

git clone https://github.com/pavithra19/apache_spark_people_data_processor.git
cd apache_spark_people_data_processor

Usage

Modify or replace /people.csv with your own data.
Adjust the code in src/main/scala/ as needed for your use case.
Use the provided shell scripts in scripts/ to automate job execution in hadoop.

Sample Data

The default data/people.csv should be in the following format:

name,age
Alice,19
Bob,20
Charlie,39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache Spark People and Customer Subscriptions Processor Project

Table of Contents

Overview

Features

Project Structure

Prerequisites

Getting Started

Usage

Sample Data

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
research_slide		research_slide
scripts		scripts
src/main/scala		src/main/scala
README.md		README.md

pavithra19/apache_spark_people_data_processor

Folders and files

Latest commit

History

Repository files navigation

Apache Spark People and Customer Subscriptions Processor Project

Table of Contents

Overview

Features

Project Structure

Prerequisites

Getting Started

Usage

Sample Data

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages